Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information

Abstract Increased knowledge of DNA-binding proteins would enhance our understanding of protein functions in cellular biological processes. To handle the explosive growth of protein sequence data, researchers have developed machine learning-based methods that quickly and accurately predict DNA-binding proteins. In recent years, the predictive accuracy of machine learning-based predictors has significantly advanced, but the predictive performance remains unsatisfactory. In this paper, we establish a novel predictor named Local-DPP, which combines the local Pse-PSSM (Pseudo Position-Specific Scoring Matrix) features with the random forest classifier. The proposed features can efficiently capture the local conservation information, together with the sequence-order information, from the evolutionary profiles (PSSMs). We evaluate and compare the Local-DPP predictor with state-of-the-art predictors on two stringent benchmark datasets (one for the jackknife test, the other for an independent test). The proposed Local-DPP significantly improved the accuracy of the existing predictors, from 77.3% to 79.2% and 76.9% to 79.0% in the jackknife and independent tests, respectively. This demonstrates the efficacy and effectiveness of Local-DPP in predicting DNA-binding proteins. The proposed Local-DPP is now freely accessible to the public through the user-friendly webserver http://server.malab.cn/Local-DPP/Index.html .

[1]  Loris Nanni,et al.  An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins , 2008, Amino Acids.

[2]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[3]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[4]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[5]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[6]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[7]  George C. Runger,et al.  Bias of Importance Measures for Multi-valued Attributes and Solutions , 2011, ICANN.

[8]  Ling Jing,et al.  Predicting DNA- and RNA-binding proteins from sequences with kernel methods. , 2009, Journal of theoretical biology.

[9]  E. Huitema,et al.  DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool , 2015, Nucleic acids research.

[10]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[11]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[14]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[15]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[16]  Yael Mandel-Gutfreund,et al.  Annotating nucleic acid-binding function based on protein structure. , 2003, Journal of molecular biology.

[17]  Pradeep Kumar Naik,et al.  BINARY CLASSIFICATION OF UNCHARACTERIZED PROTEINS INTO DNA BINDING/NON-DNA BINDING PROTEINS FROM SEQUENCE DERIVED FEATURES USING ANN , 2009 .

[18]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.

[19]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[20]  Bin Liu,et al.  Identification of DNA-binding proteins by auto-cross covariance transformation , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[21]  Hong Yan,et al.  Prediction of DNA-binding protein based on statistical and geometric features and support vector machines , 2011, Proteome Science.

[22]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[23]  Xiaolong Wang,et al.  Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile‐Based Protein Representation , 2013, Molecular informatics.

[24]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[25]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[26]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome , 2009, PLoS Comput. Biol..

[27]  N. Bhardwaj,et al.  Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions , 2007, FEBS letters.

[28]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[29]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[30]  H. Kono,et al.  Protein-DNA recognition patterns and predictions. , 2005, Annual review of biophysics and biomolecular structure.

[31]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[32]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[33]  Jeffrey Skolnick,et al.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions , 2008, Nucleic acids research.

[34]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[35]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[36]  Yixue Li,et al.  Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. , 2006, Journal of theoretical biology.

[37]  Loris Nanni,et al.  Combing ontologies and dipeptide composition for predicting DNA-binding proteins , 2007, Amino Acids.

[38]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[39]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[40]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[41]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[42]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[43]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[44]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[45]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[46]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[47]  Q. Zou,et al.  enDNA-Prot: Identification of DNA-Binding Proteins by Applying Ensemble Learning , 2014, BioMed research international.

[48]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[49]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[50]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[51]  Ondrej Kuzelka,et al.  Prediction of DNA-binding propensity of proteins by the ball-histogram method using automatic template search , 2011, BMC Bioinformatics.

[52]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.