PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method

DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.

[1]  Guohua Wang,et al.  Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods , 2017, Molecules.

[2]  Mohammad Sohel Rahman,et al.  DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC. , 2018, Journal of theoretical biology.

[3]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[4]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[5]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[6]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[7]  James G. Lyons,et al.  Protein fold recognition using HMM-HMM alignment and dynamic programming. , 2016, Journal of theoretical biology.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[10]  Avdesh Mishra,et al.  StackDPPred: a stacking based prediction of DNA‐binding protein from sequence , 2018, Bioinform..

[11]  E. Huitema,et al.  DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool , 2015, Nucleic acids research.

[12]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[13]  B. Liu,et al.  PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation , 2017, International journal of molecular sciences.

[14]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[15]  Loris Nanni,et al.  Combing ontologies and dipeptide composition for predicting DNA-binding proteins , 2007, Amino Acids.

[16]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[17]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[18]  Guy Nimrod,et al.  Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. , 2009, Journal of molecular biology.

[19]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[20]  J. Friedman Stochastic gradient boosting , 2002 .

[21]  Bin Liu,et al.  A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods , 2019, Current Bioinformatics.

[22]  Zijiang Yang,et al.  Prediction of DNA-binding proteins by interaction fusion feature representation and selective ensemble , 2019, Knowl. Based Syst..

[23]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[24]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[25]  Hui Ding,et al.  RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites , 2020, Frontiers in Bioengineering and Biotechnology.

[26]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[27]  Xiujun Gong,et al.  A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers , 2018, Genes.

[28]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[29]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[30]  Xuan Liu,et al.  Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016, IEEE Transactions on NanoBioscience.

[31]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[32]  Abdollah Dehzangi,et al.  HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features , 2017, BioMed research international.

[33]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[34]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[35]  Quan Zou,et al.  A Review of DNA-binding Proteins Prediction Methods , 2019, Current Bioinformatics.

[36]  Fei Guo,et al.  AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine , 2019, Front. Bioeng. Biotechnol..

[37]  Jijun Tang,et al.  Improved detection of DNA-binding proteins via compression technology on PSSM information , 2017, PloS one.

[38]  Lihong Peng,et al.  Improved DNA-Binding Protein Identification by Incorporating Evolutionary Information Into the Chou’s PseAAC , 2018, IEEE Access.

[39]  Li Peng,et al.  FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule , 2019, International journal of molecular sciences.

[40]  Quan Zou,et al.  SecProMTB: Support Vector Machine‐Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis , 2019, Proteomics.

[41]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[42]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[43]  Farman Ali,et al.  DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information , 2019, J. Comput. Aided Mol. Des..

[44]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[45]  Shandar Ahmad,et al.  Enabling full‐length evolutionary profiles based deep convolutional neural network for predicting DNA‐binding proteins from sequence , 2020, Proteins.

[46]  Xiujun Gong,et al.  On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach , 2017, PloS one.

[47]  Bin Liu,et al.  Identification of DNA-binding proteins by auto-cross covariance transformation , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[48]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[49]  Loris Nanni,et al.  Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins , 2019, Bioinform..

[50]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[51]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[52]  Swakkhar Shatabda,et al.  Effective DNA binding protein prediction by using key features via Chou's general PseAAC. , 2019, Journal of theoretical biology.

[53]  Siquan Hu,et al.  An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences , 2019, PloS one.