Benchmarking available bacterial promoter prediction tools: potentialities and limitations

Background The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massive mapping of promoter elements, we still mainly relay on bioinformatic tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools have become popular to identify bacterial promoters, there is no systematic comparison of such tools. Results Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, IPro70-FMWin, 70ProPred, iPromoter-2L and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used datasets of experimentally validated promoters from Escherichia coli and a control dataset composed by randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensibility, accuracy and Matthews Correlation Coefficient (MCC). We show that the widely used BPROM presented the worse performance among compared tools, while four tools (CNNProm, IPro70-FMWin, 70ProPreda and iPromoter-2L) offered high predictive power. From these, iPro70-FMWin exhibited the best results for most of the metrics used. Conclusions Therefore, we exploit here some potentials and limitations of available tools and hope future works can be built upon our effort to systematically characterize such quite useful class of bioinformatics tools.

[1]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[2]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[3]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[4]  Axel Saalbach,et al.  Libraries of synthetic stationary-phase and stress promoters as a tool for fine-tuning of expression of recombinant proteins in Escherichia coli. , 2005, Journal of biotechnology.

[5]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[6]  Vladimir B. Bajic,et al.  bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli , 2016, Bioinform..

[7]  James M. Hogan,et al.  Improved prediction of bacterial transcription start sites , 2006, Bioinform..

[8]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[9]  V. Solovyev,et al.  Automatic Annotation of Microbial Genomes and Metagenomic Sequences 3 MATERIAL AND METHODS Learning Parameters and Prediction of Protein-Coding Genes , 2013 .

[10]  M. M. Mohie-Eldin,et al.  Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors , 2015, PloS one.

[11]  J. Gore,et al.  Random sequences rapidly evolve into de novo promoters , 2018, Nature Communications.

[12]  J. Kinney,et al.  Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence , 2010, Proceedings of the National Academy of Sciences.

[13]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[14]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[15]  S. Busby,et al.  The regulation of bacterial transcription initiation , 2004, Nature Reviews Microbiology.

[16]  Sheng Wang,et al.  Image-based promoter prediction: a promoter prediction method based on evolutionarily generated patterns , 2018, Scientific Reports.

[17]  India G. Hook-Barnard,et al.  The promoter spacer influences transcription initiation via σ70 region 1.1 of Escherichia coli RNA polymerase , 2009, Proceedings of the National Academy of Sciences.

[18]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[19]  Swakkhar Shatabda,et al.  iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features , 2018, Molecular Genetics and Genomics.

[20]  A. Ishihama Functional modulation of Escherichia coli RNA polymerase. , 2000, Annual review of microbiology.

[21]  Justin B. Kinney,et al.  Logomaker: beautiful sequence logos in Python , 2019, bioRxiv.

[22]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[23]  Michael J. Sweredoski,et al.  Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria , 2018, Proceedings of the National Academy of Sciences.

[24]  Regine Hengge,et al.  Escherichia coli σ70 senses sequence and conformation of the promoter spacer region , 2011, Nucleic acids research.

[25]  Dieter Jahn,et al.  Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes , 2005, Bioinform..

[26]  Regine Hengge,et al.  Differential ability of σs and σ70 of Escherichia coli to utilize promoters containing half or full UP‐element sites , 2004 .

[27]  L. Snipen,et al.  The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes , 2017, BMC Genomics.

[28]  A. Ishihama,et al.  The Whole Set of Constitutive Promoters Recognized by RNA Polymerase RpoD Holoenzyme of Escherichia coli , 2014, PloS one.

[29]  Andre Gustavo Adami,et al.  Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors , 2018, Journal of molecular recognition : JMR.

[30]  Dieter Jahn,et al.  PRODORIC (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes , 2008, Nucleic Acids Res..

[31]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[32]  S. Kosuri,et al.  Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation , 2020, bioRxiv.

[33]  A. Ishihama,et al.  Transcription profile of Escherichia coli: genomic SELEX search for regulatory targets of transcription factors , 2016, Nucleic acids research.

[34]  R. Ebright,et al.  Bacterial promoter architecture: subsite structure of UP elements and interactions with the carboxy-terminal domain of the RNA polymerase alpha subunit. , 1999, Genes & development.

[35]  T. Lumley,et al.  gplots: Various R Programming Tools for Plotting Data , 2015 .

[36]  Jianxin Wu Hidden Markov model , 2018 .

[37]  K. Song Recognition of prokaryotic promoters based on a novel variable-window Z-curve method , 2011, Nucleic acids research.

[38]  Vili Podgorelec,et al.  Decision trees , 2018, Encyclopedia of Database Systems.

[39]  Jean-Michel Claverie,et al.  Positional clustering of differentially expressed genes on human chromosomes 20, 21 and 22 , 2003, Genome Biology.

[40]  Jinyan Li,et al.  A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts , 2006, Nucleic acids research.

[41]  Raju S. Bapi,et al.  Analysis of n-Gram based Promoter Recognition Methods and Application to Whole Genome Promoter Prediction , 2009, Silico Biol..

[42]  Diogo M. Camacho,et al.  Next-Generation Machine Learning for Biological Networks , 2018, Cell.

[43]  William Stafford Noble,et al.  Support vector machine , 2013 .

[44]  S. Busby,et al.  Activating transcription in bacteria. , 2012, Annual review of microbiology.

[45]  G. Stormo,et al.  Escherichia coli promoter sequences: analysis and prediction. , 1996, Methods in enzymology.

[46]  Uwe Ohler,et al.  Optimized mixed Markov models for motif identification , 2006, BMC Bioinformatics.

[47]  Martin Krzywinski,et al.  Points of Significance: Logistic regression , 2016, Nature Methods.

[48]  Julio Collado-Vides,et al.  RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 , 2018, Nucleic Acids Res..

[49]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[50]  Gerardo Mendizabal-Ruiz,et al.  On DNA numerical representations for genomic similarity computation , 2017, PloS one.

[51]  Scheila de Avila e Silva,et al.  BacPP: bacterial promoter prediction--a tool for accurate sigma-factor specific assignment in enterobacteria. , 2011, Journal of theoretical biology.

[52]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[53]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[54]  Z. Yakhini,et al.  Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters , 2012, Nature Biotechnology.

[55]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[56]  S. Busby,et al.  Local and global regulation of transcription initiation in bacteria , 2016, Nature Reviews Microbiology.