SDBP-Pred: Prediction of single-stranded and double-stranded DNA-binding proteins by extending consensus sequence and K-segmentation strategies into PSSM.

Identification of DNA-binding proteins (DNA-BPs) is a hot issue in protein science due to its key role in various biological processes. These processes are highly concerned with DNA-binding protein types. DNA-BPs are classified into single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSBs mainly involved in DNA recombination, replication, and repair, while DSBs regulate transcription process, DNA cleavage, and chromosome packaging. In spite of the aforementioned significance, few methods have been proposed for discrimination of SSBs and DSBs. Therefore, more predictors with favorable performance are indispensable. In this work, we present an innovative predictor, called SDBP-Pred with a novel feature descriptor, named consensus sequence-based K-segmentation position-specific scoring matrix (CSKS-PSSM). We encoded the local discriminative features concealed in PSSM via K-segmentation strategy and the global potential features by applying the notion of the consensus sequence. The obtained feature vector then input to support vector machine (SVM) with linear, polynomial and radial base function (RBF) kernels. Our model with SVM-RBF achieved the highest accuracies on three tests namely jackknife, 10-fold, and independent tests, respectively than the recent method. The obtained prediction results illustrate the superlative prediction performance of SDBP-Pred over existing studies in the literature so far.

[1]  H M Berman,et al.  Protein-DNA interactions: A structural analysis. , 1999, Journal of molecular biology.

[2]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[3]  Maqsood Hayat,et al.  Author ' s Accepted Manuscript Classification of membrane protein types using Voting feature interval in combination with Chou ' s pseudo amino acid composition , 2015 .

[4]  J. N. Reeve,et al.  Diversity of prokaryotic chromosomal proteins and the origin of the nucleosome , 1998, Cellular and Molecular Life Sciences CMLS.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Balachandran Manavalan,et al.  MLACP: machine-learning-based prediction of anticancer peptides , 2017, Oncotarget.

[7]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[8]  Muhammad Arif,et al.  Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles. , 2019, Analytical biochemistry.

[9]  Babak Nadjar Araabi,et al.  A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM , 2011, Comput. Biol. Chem..

[10]  Kuo-Chen Chou,et al.  pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments. , 2019, Journal of theoretical biology.

[11]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[12]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[13]  Farman Ali,et al.  DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information , 2019, J. Comput. Aided Mol. Des..

[14]  Y-h. Taguchi,et al.  Application of amino acid occurrence for discriminating different folding types of globular proteins , 2007, BMC Bioinformatics.

[15]  C Zimmer,et al.  Nonintercalating DNA-binding ligands: specificity of the interaction and their use as tools in biophysical, biochemical and biological investigations of the genetic material. , 1986, Progress in biophysics and molecular biology.

[16]  Ruijun Zhang,et al.  Fu-SulfPred: Identification of Protein S-sulfenylation Sites by Fusing Forests via Chou's General PseAAC. , 2019, Journal of theoretical biology.

[17]  Lin Sun,et al.  Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences , 2017, BMC Bioinformatics.

[18]  Nicholas M. Luscombe,et al.  Amino acid?base interactions: a three-dimensional analysis of protein?DNA interactions at an atomic level , 2001, Nucleic Acids Res..

[19]  Dechang Pi,et al.  iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[20]  Kuo-Chen Chou,et al.  Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches. , 2012, Biochimie.

[21]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  Markus Brameier,et al.  Automatic discovery of cross-family sequence features associated with protein function , 2006, BMC Bioinformatics.

[24]  Yang Li,et al.  Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[26]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[27]  Kuo-Chen Chou,et al.  iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. , 2018, Analytical biochemistry.

[28]  Kuo-Chen Chou,et al.  iPPI-PseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. , 2019, Journal of theoretical biology.

[29]  Kathrin Meindl,et al.  Structure solution of DNA-binding proteins and complexes with ARCIMBOLDO libraries , 2014, Acta crystallographica. Section D, Biological crystallography.

[30]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[31]  Ahmad Hassan Butt,et al.  Predicting membrane proteins and their types by extracting various sequence features into Chou’s general PseAAC , 2018, Molecular Biology Reports.

[32]  Marcin Olszewski,et al.  Single-stranded DNA-binding proteins (SSBs) -- sources and applications in molecular biology. , 2005, Acta biochimica Polonica.

[33]  Abdollah Dehzangi,et al.  HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features , 2017, BioMed research international.

[34]  Zhe Ju,et al.  Predicting lysine lipoylation sites using bi-profile bayes feature extraction and fuzzy support vector machine algorithm. , 2018, Analytical biochemistry.

[35]  Jun Hu,et al.  TargetDBP: Accurate DNA-Binding Protein Prediction Via Sequence-Based Multi-View Feature Learning , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Myeong Ok Kim,et al.  PIP-EL: A New Ensemble Learning Method for Improved Proinflammatory Peptide Predictions , 2018, Front. Immunol..

[37]  Jun Hu,et al.  TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM , 2016, Amino Acids.

[38]  N.R. Pal,et al.  Prediction of Protein Folds: Extraction of New Features, Dimensionality Reduction, and Fusion of Heterogeneous Classifiers , 2009, IEEE Transactions on NanoBioscience.

[39]  Muhammad Kabir,et al.  An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data. , 2019, Combinatorial chemistry & high throughput screening.

[40]  Wei Wang,et al.  Identification of single-stranded and double-stranded dna binding proteins based on protein structure , 2014, BMC Bioinformatics.

[41]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[42]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[43]  Kuo-Chen Chou,et al.  SPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. , 2019, Journal of theoretical biology.

[44]  J. Thornton,et al.  An overview of the structures of protein-DNA complexes , 2000, Genome Biology.

[45]  Saeed Ahmad,et al.  iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition , 2015, Comput. Biol. Medicine.

[46]  Jun Hu,et al.  ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons , 2018, J. Chem. Inf. Model..

[47]  Kuo-Chen Chou,et al.  SPalmitoylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. , 2019, Analytical biochemistry.

[48]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[49]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[50]  Mohammad Sohel Rahman,et al.  DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC. , 2018, Journal of theoretical biology.

[51]  Steven J. M. Jones,et al.  Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. , 2006, Genome research.

[52]  Shunichi Takeda,et al.  Single-stranded DNA-binding protein hSSB1 is critical for genomic stability , 2008, Nature.

[53]  Maqsood Hayat,et al.  Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space. , 2016, Journal of theoretical biology.

[54]  Saeed Ahmad,et al.  Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC , 2015, Comput. Methods Programs Biomed..

[55]  Gwang Lee,et al.  AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest , 2018, Front. Pharmacol..

[56]  Kuo-Chen Chou,et al.  pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. , 2019, Genomics.

[57]  Kuldip K. Paliwal,et al.  A Segmentation-Based Method to Extract Structural and Evolutionary Features for Protein Fold Recognition , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[58]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[59]  Jun Hu,et al.  LS‐align: an atom‐level, flexible ligand structural alignment algorithm for high‐throughput virtual screening , 2018, Bioinform..

[60]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[61]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[62]  Kuo-Chen Chou,et al.  iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC , 2018, Molecular Biology Reports.

[63]  Farman Ali,et al.  Improving secretory proteins prediction in Mycobacterium tuberculosis using the unbiased dipeptide composition with support vector machine , 2018 .

[64]  Muhammad Kabir,et al.  Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition , 2017 .

[65]  Balachandran Manavalan,et al.  DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest , 2017, bioRxiv.

[66]  Hong Gu,et al.  Predicting pupylation sites in prokaryotic proteins using semi-supervised self-training support vector machine algorithm. , 2016, Analytical biochemistry.

[67]  K. Chou,et al.  iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. , 2019, Genomics.

[68]  Avdesh Mishra,et al.  StackDPPred: a stacking based prediction of DNA‐binding protein from sequence , 2018, Bioinform..

[69]  Dik-Lung Ma,et al.  DNA‐Binding Small Molecules as Inhibitors of Transcription Factors , 2013, Medicinal research reviews.

[70]  Janet M Thornton,et al.  Identifying DNA-binding proteins using structural motifs and the electrostatic potential. , 2004, Nucleic acids research.

[71]  Zaheer Ullah Khan,et al.  DBPPred-PDSD: Machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space , 2018, Chemometrics and Intelligent Laboratory Systems.