Benchmarking Bacterial Promoter Prediction Tools: Potentialities and Limitations

The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives. ABSTRACT The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, iPro70-FMWin, 70ProPred, iPromoter-2L, and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). We show that the widely used BPROM presented the worse performance among the compared tools, while four tools (CNNProm, iPro70-FMWin, 70ProPred, and iPromoter-2L) offered high predictive power. Of these tools, iPro70-FMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. IMPORTANCE The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.

[1]  A. Ishihama Functional modulation of Escherichia coli RNA polymerase. , 2000, Annual review of microbiology.

[2]  S. Busby,et al.  Activating transcription in bacteria. , 2012, Annual review of microbiology.

[3]  Vivek K. Mutalik,et al.  Predicting the strength of UP-elements and full-length E. coli σE promoters , 2011, Nucleic acids research.

[4]  Dieter Jahn,et al.  PRODORIC (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes , 2008, Nucleic Acids Res..

[5]  A. Arkin,et al.  Redefining fundamental concepts of transcription initiation in bacteria , 2020, Nature Reviews Genetics.

[6]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[7]  S. Kosuri,et al.  Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation , 2020, bioRxiv.

[8]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[9]  Sean R Eddy,et al.  What is a hidden Markov model? , 2004, Nature Biotechnology.

[10]  Vladimir B. Bajic,et al.  bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli , 2016, Bioinform..

[11]  Shenghu Zhou,et al.  The application of powerful promoters to enhance gene expression in industrial microorganisms , 2017, World journal of microbiology & biotechnology.

[12]  J. Gore,et al.  Random sequences rapidly evolve into de novo promoters , 2018, Nature Communications.

[13]  L. Snipen,et al.  The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes , 2017, BMC Genomics.

[14]  A. Ishihama,et al.  The Whole Set of Constitutive Promoters Recognized by RNA Polymerase RpoD Holoenzyme of Escherichia coli , 2014, PloS one.

[15]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[16]  Z. Yakhini,et al.  Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters , 2012, Nature Biotechnology.

[17]  G. Stormo,et al.  Escherichia coli promoter sequences: analysis and prediction. , 1996, Methods in enzymology.

[18]  Uwe Ohler,et al.  Optimized mixed Markov models for motif identification , 2006, BMC Bioinformatics.

[19]  Martin Krzywinski,et al.  Points of Significance: Logistic regression , 2016, Nature Methods.

[20]  Andre Gustavo Adami,et al.  Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors , 2018, Journal of molecular recognition : JMR.

[21]  Jinyan Li,et al.  A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts , 2006, Nucleic acids research.

[22]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[23]  Rob Phillips,et al.  Tuning Promoter Strength through RNA Polymerase Binding Site Design in Escherichia coli , 2012, PLoS Comput. Biol..

[24]  J. Helmann,et al.  The σ70family of sigma factors , 2003, Genome Biology.

[25]  Julio Collado-Vides,et al.  RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 , 2018, Nucleic Acids Res..

[26]  Raju S. Bapi,et al.  Analysis of n-Gram based Promoter Recognition Methods and Application to Whole Genome Promoter Prediction , 2009, Silico Biol..

[27]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[28]  S. Busby,et al.  Local and global regulation of transcription initiation in bacteria , 2016, Nature Reviews Microbiology.

[29]  Jo Maertens,et al.  Construction and model-based analysis of a promoter library for E. coli: an indispensable tool for metabolic engineering , 2007, BMC biotechnology.

[30]  A. Ishihama,et al.  Transcription profile of Escherichia coli: genomic SELEX search for regulatory targets of transcription factors , 2016, Nucleic acids research.

[31]  R. Ebright,et al.  Bacterial promoter architecture: subsite structure of UP elements and interactions with the carboxy-terminal domain of the RNA polymerase alpha subunit. , 1999, Genes & development.

[32]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[33]  Justin B Kinney,et al.  Logomaker: beautiful sequence logos in Python , 2019, Bioinformatics.

[34]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[35]  James M. Hogan,et al.  Improved prediction of bacterial transcription start sites , 2006, Bioinform..

[36]  Sheng Wang,et al.  Image-based promoter prediction: a promoter prediction method based on evolutionarily generated patterns , 2018, Scientific Reports.

[37]  Robert W. Li Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies , 2011 .

[38]  S. Kosuri,et al.  Systematic Dissection of Sequence Elements Controlling σ70 Promoters Using a Genomically Encoded Multiplexed Reporter Assay in Escherichia coli. , 2018, Biochemistry.

[39]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[40]  Michael J. Sweredoski,et al.  Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria , 2018, Proceedings of the National Academy of Sciences.

[41]  Regine Hengge,et al.  Escherichia coli σ70 senses sequence and conformation of the promoter spacer region , 2011, Nucleic acids research.

[42]  J. Kinney,et al.  Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence , 2010, Proceedings of the National Academy of Sciences.

[43]  Dieter Jahn,et al.  Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes , 2005, Bioinform..

[44]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[45]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[46]  Thayer Regulation of Tissue-Specific Gene Expression in Microcell Hybrids , 1996, Methods.

[47]  M. M. Mohie-Eldin,et al.  Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors , 2015, PloS one.

[48]  Swakkhar Shatabda,et al.  iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features , 2018, Molecular Genetics and Genomics.

[49]  Axel Saalbach,et al.  Libraries of synthetic stationary-phase and stress promoters as a tool for fine-tuning of expression of recombinant proteins in Escherichia coli. , 2005, Journal of biotechnology.

[50]  T. Lumley,et al.  gplots: Various R Programming Tools for Plotting Data , 2015 .

[51]  K. Song Recognition of prokaryotic promoters based on a novel variable-window Z-curve method , 2011, Nucleic acids research.

[52]  S. Busby,et al.  The regulation of bacterial transcription initiation , 2004, Nature Reviews Microbiology.

[53]  Scheila de Avila e Silva,et al.  BacPP: bacterial promoter prediction--a tool for accurate sigma-factor specific assignment in enterobacteria. , 2011, Journal of theoretical biology.

[54]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[55]  Regine Hengge,et al.  Differential ability of σs and σ70 of Escherichia coli to utilize promoters containing half or full UP‐element sites , 2004 .

[56]  A. Krogh What are artificial neural networks? , 2008, Nature Biotechnology.

[57]  John D. Helmann,et al.  Protein family review - The sigma(70) family of sigma factors , 2003 .

[58]  Carl Kingsford,et al.  What are decision trees? , 2008, Nature Biotechnology.

[59]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[60]  Gerardo Mendizabal-Ruiz,et al.  On DNA numerical representations for genomic similarity computation , 2017, PloS one.

[61]  India G. Hook-Barnard,et al.  The promoter spacer influences transcription initiation via σ70 region 1.1 of Escherichia coli RNA polymerase , 2009, Proceedings of the National Academy of Sciences.

[62]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[63]  Diogo M. Camacho,et al.  Next-Generation Machine Learning for Biological Networks , 2018, Cell.

[64]  William Stafford Noble,et al.  Support vector machine , 2013 .