Prediction of plant promoters based on hexamers and random triplet pair analysis

BackgroundWith an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters.MethodsIn this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot.ResultsPromoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity.ConclusionsWe compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.

[1]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  R. Bruskiewich,et al.  Characterization of statistical features for plant microRNA prediction , 2011, BMC Genomics.

[4]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..

[5]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[6]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[7]  E. Grotewold,et al.  Genome wide analysis of Arabidopsis core promoters , 2005, BMC Genomics.

[8]  David M. Kramer,et al.  Biochemistry and Molecular Biology , 1968, Nature.

[9]  Qunfeng Dong,et al.  PlantGDB, plant genome database and analysis tools , 2004, Nucleic Acids Res..

[10]  Lynn F. Ten Eyck,et al.  A helix scaffold for the assembly of active protein kinases , 2008, Proceedings of the National Academy of Sciences.

[11]  Ray Walshe,et al.  Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach , 2008, BMC Bioinformatics.

[12]  P. Civáň,et al.  Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements. , 2009, Genome.

[13]  Yvan Saeys,et al.  Toward a gold standard for promoter prediction evaluation , 2009, Bioinform..

[14]  A. Krishnamachari,et al.  Computational analysis of plant RNA Pol-II promoters. , 2006, Bio Systems.

[15]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[16]  Victor V. Solovyev,et al.  The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences , 1997, ISMB.

[17]  John M. Hancock,et al.  PlantProm: a database of plant promoter sequences , 2003, Nucleic Acids Res..

[18]  A. J. Gammerman,et al.  Plant promoter prediction with confidence estimation , 2005, Nucleic acids research.

[19]  Shuigeng Zhou,et al.  A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles , 2009, Bioinform..

[20]  C. Testerink,et al.  Sequences surrounding the transcription initiation site of the Arabidopsis enoyl-acyl carrier protein reductase gene control seed expression in transgenic tobacco , 1999, Plant Molecular Biology.

[21]  Jurg Ott,et al.  Distribution and characterization of regulatory elements in the human genome. , 2002, Genome research.

[22]  Alexander N. Gorban,et al.  Seven clusters in genomic triplet distributions , 2003, Silico Biol..

[23]  Quan Wang,et al.  Searching for bidirectional promoters in Arabidopsis thaliana , 2009, BMC Bioinformatics.

[24]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[25]  Sumio Sugano,et al.  Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis , 2007, Nucleic acids research.

[26]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[27]  Susan S. Taylor,et al.  Conserved spatial patterns across the protein kinase family. , 2008, Biochimica et biophysica acta.

[28]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.