EnsemPro: an ensemble approach to predicting transcription start sites in human genomic DNA sequences.

Although several computational methods have been developed to identify transcription start sites (TSSs)/promoters, the computational prediction still needs improvement. Due to low performance, the promoter prediction programs can provide misleading results in functional genomic studies. To improve the prediction accuracy, we propose the use of an ensemble approach, EnsemPro (Ensemble Promoter), which combines the prediction results of the existing promoter predictors. We schematically compared the prediction performance of the currently available promoter prediction programs in an identical evaluating environment, and the results served as a guide for choosing the combined predictors. We applied three representative ensemble schemes-the majority voting, the weighted voting, and the Bayesian approach-for the TSS prediction of hundreds of human genomic sequences. EnsemPro identified the TSSs more precisely than other combining methods as well as the currently available individual predictor programs. The source code of EnsemPro is available on request from the authors.

[1]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[2]  Victor V. Solovyev,et al.  The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences , 1997, ISMB.

[3]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..

[4]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[5]  Kamal A. Ali,et al.  On the Link between Error Correlation and Error Reduction in Decision Tree Ensembles , 1995 .

[6]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[7]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[8]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[9]  Nicolas de Condorcet Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .

[10]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[11]  Jon Atli Benediktsson,et al.  Proceedings of the 8th International Workshop on Multiple Classifier Systems , 2009, International Workshop on Multiple Classifier Systems.

[12]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[13]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[14]  Rongxiang Liu,et al.  Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. , 2002, Genome research.

[15]  Philipp Bucher,et al.  The Eukaryotic Promoter Database (EPD) , 2000, Nucleic Acids Res..

[16]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[17]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[18]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..