A comparison study on feature selection of DNA structural properties for promoter prediction

BackgroundPromoter prediction is an integrant step for understanding gene regulation and annotating genomes. Traditional promoter analysis is mainly based on sequence compositional features. Recently, many kinds of structural features have been employed in promoter prediction. However, considering the high-dimensionality and overfitting problems, it is unfeasible to utilize all available features for promoter prediction. Thus it is necessary to choose some appropriate features for the prediction task.ResultsThis paper conducts an extensive comparison study on feature selection of DNA structural properties for promoter prediction. Firstly, to examine whether promoters possess some special structures, we carry out a systematical comparison among the profiles of thirteen structural features on promoter and non-promoter sequences. Secondly, we investigate the correlations between these structural features and promoter sequences. Thirdly, both filter and wrapper methods are utilized to select appropriate feature subsets from thirteen different kinds of structural features for promoter prediction, and the predictive power of the selected feature subsets is evaluated. Finally, we compare the prediction performance of the feature subsets selected in this paper with nine existing promoter prediction approaches.ConclusionsExperimental results show that the structural features are differentially correlated to promoters. Specifically, DNA-bending stiffness, DNA denaturation and energy-related features are highly correlated with promoters. The predictive power for promoter sequences differentiates greatly among different structural features. Selecting the relevant features can significantly improve the accuracy of promoter prediction.

[1]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[2]  M. A. El Hassan,et al.  Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. , 1996, Journal of molecular biology.

[3]  Yvan Saeys,et al.  Large-scale structural analysis of the core promoter in mammalian and plant genomes , 2005, Nucleic acids research.

[4]  Ronald W. Davis,et al.  A high-resolution atlas of nucleosome occupancy in yeast , 2007, Nature Genetics.

[5]  Seng Hong Seah,et al.  Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. , 2003, Genome research.

[6]  Irene K. Moore,et al.  The DNA-encoded nucleosome organization of a eukaryotic genome , 2009, Nature.

[7]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[8]  Kenta Nakai,et al.  DBTSS: database of transcription start sites, progress report 2008 , 2007, Nucleic Acids Res..

[9]  H. Blöcker,et al.  Predicting DNA duplex stability from the base sequence. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Leah Barrera,et al.  A high-resolution map of active promoters in the human genome , 2005, Nature.

[11]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[12]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[13]  M. Narasimha Murty,et al.  Nearest Neighbour Based Classifiers , 2011 .

[14]  Yvan Saeys,et al.  ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles , 2008, ISMB.

[15]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[16]  Manju Bansal,et al.  Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes , 2005, Nucleic acids research.

[17]  N. Sugimoto,et al.  Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. , 1996, Nucleic acids research.

[18]  V. Zhurkin,et al.  B-DNA twisting correlates with base-pair morphology. , 1995, Journal of molecular biology.

[19]  Kenta Nakai,et al.  DBTSS: DataBase of Human Transcription Start Sites, progress report 2006 , 2005, Nucleic Acids Res..

[20]  Manju Bansal,et al.  DNA Free Energy-Based Promoter Prediction and Comparative Analysis of Arabidopsis and Rice Genomes1[C][W][OA] , 2011, Plant Physiology.

[21]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[22]  Desmond G. Higgins,et al.  High DNA melting temperature predicts transcription start site location in human and mouse , 2009, Nucleic acids research.

[23]  Brigitte Hartmann,et al.  Sequence-dependent DNA flexibility mediates DNase I cleavage. , 2010, Journal of molecular biology.

[24]  Francisco J. Agosto-Perez,et al.  Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq , 2010, Nucleic Acids Res..

[25]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[26]  G M Rubin,et al.  Insertion site preferences of the P transposable element in Drosophila melanogaster. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Fang Liu,et al.  The Human Genomic Melting Map , 2007, PLoS Comput. Biol..

[28]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[29]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[30]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[31]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[32]  Yvan Saeys,et al.  Toward a gold standard for promoter prediction evaluation , 2009, Bioinform..

[33]  I. Brukner,et al.  Trinucleotide models for DNA bending propensity: comparison of models based on DNaseI digestion and nucleosome packaging data. , 1995, Journal of biomolecular structure & dynamics.

[34]  Ivanov Vi,et al.  [The A-form of DNA: in search of the biological role]. , 1994 .

[35]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[36]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[37]  A V Sivolob,et al.  Translational positioning of nucleosomes on DNA: the role of sequence-dependent isotropic DNA bending stiffness. , 1995, Journal of molecular biology.

[38]  V. Solovyev,et al.  Automatic annotation of eukaryotic genes, pseudogenes and promoters , 2006, Genome Biology.

[39]  Dustin E. Schones,et al.  Dynamic Regulation of Nucleosome Positioning in the Human Genome , 2008, Cell.

[40]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[41]  Satoshi Fujii,et al.  Sequence-dependent DNA deformability studied using molecular dynamics simulations , 2007, Nucleic acids research.

[42]  V I Ivanov,et al.  [The A-form of DNA: in search of the biological role]. , 1994, Molekuliarnaia biologiia.

[43]  R. Ornstein,et al.  An optimized potential function for the calculation of nucleic acid interaction energies I. Base stacking , 1978, Biopolymers.

[44]  P. S. Ho,et al.  Polarized electronic spectra of Z‐DNA single crystals , 1990, Biopolymers.

[45]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[46]  BMC Bioinformatics , 2005 .

[47]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[48]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[49]  Boris Lenhard,et al.  Mammalian RNA polymerase II core promoters: insights from genome-wide studies , 2007, Nature Reviews Genetics.

[50]  Shuigeng Zhou,et al.  A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles , 2009, Bioinform..

[51]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.

[52]  V. Zhurkin,et al.  DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[54]  R. Blake,et al.  Thermal stability of DNA. , 1998, Nucleic acids research.

[55]  Martin S. Taylor,et al.  Genome-wide analysis of mammalian promoter architecture and evolution , 2006, Nature Genetics.

[56]  Hong Yan,et al.  Towards accurate human promoter recognition: a review of currently used sequence features and classification methods , 2009, Briefings Bioinform..

[57]  Françoise Argoul,et al.  Nucleosome positioning by genomic excluding-energy barriers , 2009, Proceedings of the National Academy of Sciences.