iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features

In bacterial DNA, there are specific sequences of nucleotides called promoters that can bind to the RNA polymerase. Sigma70 ($$\sigma ^{70}$$σ70) is one of the most important promoter sequences due to its presence in most of the DNA regulatory functions. In this paper, we identify the most effective and optimal sequence-based features for prediction of $$\sigma ^{70}$$σ70 promoter sequences in a bacterial genome. We used both short-range and long-range DNA sequences in our proposed method. A very small number of effective features are selected from a large number of the extracted features using multi-window of different sizes within the DNA sequences. We call our prediction method iPro70-FMWin and made it freely accessible online via a web application established at http://ipro70.pythonanywhere.com/server for the sake of convenience of the researchers. We have tested our method using a standard benchmark dataset. In the experiments, iPro70-FMWin has achieved an area under the curve of the receiver operating characteristic and accuracy of 0.959 and 90.57%, respectively, which significantly outperforms the state-of-the-art predictors.

[1]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[2]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[3]  Fabio Rinaldi,et al.  RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond , 2015, Nucleic Acids Res..

[4]  Mohammad Sohel Rahman,et al.  DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC. , 2018, Journal of theoretical biology.

[5]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[6]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[7]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[8]  K. Chou Prediction of signal peptides using scaled window , 2001, Peptides.

[9]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[10]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Vincent Noireaux,et al.  Efficient cell-free expression with the endogenous E. Coli RNA polymerase and sigma factor 70 , 2010, Journal of biological engineering.

[12]  Hao Lin,et al.  Eukaryotic and prokaryotic promoter prediction using hybrid approach , 2011, Theory in Biosciences.

[13]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[14]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma‐54 promoters , 2016, Bioinform..

[15]  James M. Hogan,et al.  The cross-species prediction of bacterial promoters using a support vector machine , 2008, Comput. Biol. Chem..

[16]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[17]  Md. Rafsan Jani,et al.  iPromoter-FSEn: Identification of bacterial σ70 promoter sequences using feature subspace based ensemble classifier. , 2019, Genomics.

[18]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[19]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[20]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[21]  Stephan J Sanders,et al.  A framework for the interpretation of de novo mutation in human disease , 2014, Nature Genetics.

[22]  I. Korf,et al.  GC skew at the 5′ and 3′ ends of human genes links R-loop formation to epigenetic regulation and transcription termination , 2013, Genome research.

[23]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[24]  Hao Lin,et al.  The recognition and prediction of σ70 promoters in Escherichia coli K-12 , 2006 .

[25]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[26]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  K. Song Recognition of prokaryotic promoters based on a novel variable-window Z-curve method , 2011, Nucleic acids research.

[29]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[30]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[31]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[32]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[33]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[34]  Tahila Andrighetti,et al.  DNA duplex stability as discriminative characteristic for Escherichia coli σ(54)- and σ(28)- dependent promoter sequences. , 2014, Biologicals : journal of the International Association of Biological Standardization.

[35]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[36]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[37]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[38]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[39]  Abdollah Dehzangi,et al.  iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting , 2017, Scientific Reports.

[40]  A V Lukashin,et al.  Neural network models for promoter recognition. , 1989, Journal of biomolecular structure & dynamics.

[41]  Shengli Zhang,et al.  Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC. , 2018, Journal of theoretical biology.

[42]  Nong Ye,et al.  Naïve Bayes Classifier , 2013 .

[43]  C. Gross,et al.  Multiple sigma subunits and the partitioning of bacterial transcription space. , 2003, Annual review of microbiology.

[44]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[45]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[46]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[47]  Shuigeng Zhou,et al.  A comparison study on feature selection of DNA structural properties for promoter prediction , 2012, BMC Bioinformatics.

[48]  J. Lobry Asymmetric substitution patterns in the two DNA strands of bacteria. , 1996, Molecular biology and evolution.

[49]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[50]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[51]  Lee R. Lynd,et al.  Identifying promoters for gene expression in Clostridium thermocellum , 2015, Metabolic engineering communications.

[52]  Kristof Coussement,et al.  Faculteit Economie En Bedrijfskunde Hoveniersberg 24 B-9000 Gent Churn Prediction in Subscription Services: an Application of Support Vector Machines While Comparing Two Parameter-selection Techniques Churn Prediction in Subscription Services: an Application of Support Vector Machines While Comparin , 2022 .

[53]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[54]  De-Shuang Huang,et al.  iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC , 2018, Bioinform..

[55]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[56]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[57]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[58]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[59]  De-Shuang Huang,et al.  iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[60]  F.-M. Li,et al.  Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach , 2007, Amino Acids.

[61]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[62]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[63]  Juan Mei,et al.  Analysis and prediction of presynaptic and postsynaptic neurotoxins by Chou's general pseudo amino acid composition and motif features. , 2018, Journal of theoretical biology.

[64]  James M. Hogan,et al.  Improved prediction of bacterial transcription start sites , 2006, Bioinform..

[65]  Stefan Maetschke,et al.  Genome-wide analysis of chlamydiae for promoters that phylogenetically footprint. , 2007, Research in microbiology.

[66]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[67]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[68]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[69]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[70]  Juan Mei,et al.  Prediction of HIV-1 and HIV-2 proteins by using Chou’s pseudo amino acid compositions and different classifiers , 2018, Scientific Reports.

[71]  S. Muthu Krishnan,et al.  Using Chou's general PseAAC to analyze the evolutionary relationship of receptor associated proteins (RAP) with various folding patterns of protein domains. , 2018 .

[72]  Ernesto Contreras-Torres,et al.  Predicting structural classes of proteins by incorporating their global and local physicochemical and conformational properties into general Chou's PseAAC. , 2018, Journal of theoretical biology.

[73]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[74]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[75]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[76]  Zahoor Jan,et al.  iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition. , 2018, Journal of theoretical biology.

[77]  Mukhtaj Khan,et al.  Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. , 2018, Journal of theoretical biology.

[78]  D. Williamson,et al.  The box plot: a simple visual method to interpret data. , 1989, Annals of internal medicine.

[79]  H. Yamagishi Nucleotide distribution in bacterial DNA's differing in G + C content , 2005, Journal of Molecular Evolution.

[80]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[81]  Linlin Shen,et al.  AdaBoost Gabor Feature Selection for Classification , 2004 .

[82]  Benjamin F. Voight,et al.  Nature Genetics Advance Online Publication a N a Ly S I S an Expanded Sequence Context Model Broadly Explains Variability in Polymorphism Levels across the Human Genome , 2022 .

[83]  M. A. El Hassan,et al.  Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. , 1996, Journal of molecular biology.

[84]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[85]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[86]  David H. Ardell,et al.  An iterative strategy combining biophysical criteria and duration hidden Markov models for structural predictions of Chlamydia trachomatis σ66 promoters , 2009, BMC Bioinformatics.

[87]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[88]  F. Agakov,et al.  Application of high-dimensional feature selection: evaluation for genomic prediction in man , 2015, Scientific Reports.