DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information

DNA-binding proteins (DBPs) participate in various biological processes including DNA replication, recombination, and repair. In the human genome, about 6–7% of these proteins are utilized for genes encoding. DBPs shape the DNA into a compact structure known chromatin while some of these proteins regulate the chromosome packaging and transcription process. In the pharmaceutical industry, DBPs are used as a key component of antibiotics, steroids, and cancer drugs. These proteins also involve in biophysical, biological, and biochemical studies of DNA. Due to the crucial role in various biological activities, identification of DBPs is a hot issue in protein science. A series of experimental and computational methods have been proposed, however, some methods didn’t achieve the desired results while some are inadequate in its accuracy and authenticity. Still, it is highly desired to present more intelligent computational predictors. In this work, we introduce an innovative computational method namely DP-BINDER based on physicochemical and evolutionary information. We captured local highly decisive features from physicochemical properties of primary protein sequences via normalized Moreau-Broto autocorrelation (NMBAC) and evolutionary information by position specific scoring matrix-transition probability composition (PSSM-TPC) and pseudo-position specific scoring matrix (PsePSSM) using training and independent datasets. The optimal features were selected by the support vector machine-recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) from fused features and were fed into random forest (RF) and support vector machine (SVM). Our method attained 92.46% and 89.58% accuracy with jackknife and ten-fold cross-validation, respectively on the training dataset, while 81.17% accuracy on the independent dataset for prediction of DBPs. These results demonstrate that our method attained the highest success rate in the literature. The superiority of DP-BINDER over existing approaches due to several reasons including abstraction of local dominant features via effective feature descriptors, utilization of appropriate feature selection algorithms and effective classifier.

[1]  Jie Yang,et al.  Predicting subcellular localization of gram-negative bacterial proteins by linear dimensionality reduction method. , 2010, Protein and peptide letters.

[2]  Menglong Li,et al.  Functional classification of secreted proteins by position specific scoring matrix and auto covariance , 2012 .

[3]  Armen Stepanyants,et al.  DNA bridging and looping by HMO1 provides a mechanism for stabilizing nucleosome-free chromatin , 2014, Nucleic acids research.

[4]  Abdollah Dehzangi,et al.  iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features , 2017, Scientific Reports.

[5]  De-shuang Huang,et al.  PNImodeler: web server for inferring protein-binding nucleotides from sequence data , 2015, BMC Genomics.

[6]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Sébastien Rigali,et al.  Chapter 1: Variation in form and function the helix-turn-helix regulators of the GntR superfamily. , 2009, Advances in applied microbiology.

[8]  Feng Zhang,et al.  Targeted Mutagenesis of Duplicated Genes in Soybean with Zinc-Finger Nucleases1[W][OA] , 2011, Plant Physiology.

[9]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[10]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome , 2009, PLoS Comput. Biol..

[11]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[12]  Yolanda Santiago,et al.  Efficient generation of a biallelic knockout in pigs using zinc-finger nucleases , 2011, Proceedings of the National Academy of Sciences.

[13]  Saeed Ahmad,et al.  Improving prediction of extracellular matrix proteins using evolutionary information via a grey system model and asymmetric under-sampling technique , 2018 .

[14]  Yael Mandel-Gutfreund,et al.  BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins , 2016, Nucleic Acids Res..

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  P E Bourne,et al.  The Protein Data Bank. , 2002, Nucleic acids research.

[17]  Maqsood Hayat,et al.  Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space. , 2016, Journal of theoretical biology.

[18]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[19]  Saeed Ahmad,et al.  Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC , 2015, Comput. Methods Programs Biomed..

[20]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[21]  Maoxiang Chu,et al.  Steel surface defect classification using multiple hyper-spheres support vector machine with additional information , 2018 .

[22]  Feng Ye,et al.  Using principal component analysis and support vector machine to predict protein structural class for low-similarity sequences via PSSM , 2012, Journal of biomolecular structure & dynamics.

[23]  Ning Li,et al.  Highly efficient modification of beta-lactoglobulin (BLG) gene via zinc-finger nucleases in cattle , 2011, Cell Research.

[24]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[25]  J. Darnell,et al.  Structure of the amino-terminal protein interaction domain of STAT-4. , 1998, Science.

[26]  Kathrin Meindl,et al.  Structure solution of DNA-binding proteins and complexes with ARCIMBOLDO libraries , 2014, Acta crystallographica. Section D, Biological crystallography.

[27]  Muhammad Kabir,et al.  An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data. , 2019, Combinatorial chemistry & high throughput screening.

[28]  Shengli Zhang,et al.  Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou's general PseAAC , 2015 .

[29]  James G. Lyons,et al.  Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC , 2015, IEEE Transactions on NanoBioscience.

[30]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[31]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[32]  Kerrie L. Mengersen,et al.  Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  R. Mann,et al.  Origins of specificity in protein-DNA recognition. , 2010, Annual review of biochemistry.

[34]  Janet M Thornton,et al.  Identifying DNA-binding proteins using structural motifs and the electrostatic potential. , 2004, Nucleic acids research.

[35]  Lei Zhang,et al.  Targeted transgene integration in plant cells using designed zinc finger nucleases , 2009, Plant Molecular Biology.

[36]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[37]  Zaheer Ullah Khan,et al.  DBPPred-PDSD: Machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space , 2018, Chemometrics and Intelligent Laboratory Systems.

[38]  Sun-Yuan Kung,et al.  Gram-LocEN: Interpretable prediction of subcellular multi-localization of Gram-positive and Gram-negative bacterial proteins , 2017 .

[39]  Piyushkumar A. Mundra,et al.  SVM-RFE with Relevancy and Redundancy Criteria for Gene Selection , 2007, PRIB.

[40]  Yuedong Yang,et al.  Predicting DNA-Binding Proteins and Binding Residues by Complex Structure Prediction and Application to Human Proteome , 2014, PloS one.

[41]  Wei-Ting Hwang,et al.  Gene editing of CCR5 in autologous CD4 T cells of persons infected with HIV. , 2014, The New England journal of medicine.

[42]  Maqsood Hayat,et al.  Author ' s Accepted Manuscript Classification of membrane protein types using Voting feature interval in combination with Chou ' s pseudo amino acid composition , 2015 .

[43]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[44]  Nasrollah Moghaddam Charkari,et al.  A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification , 2015, J. Biomed. Informatics.

[45]  B. Liu,et al.  PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation , 2017, International journal of molecular sciences.

[46]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[47]  Philip E. Bourne,et al.  The Protein Data Bank, 1999– , 2006 .

[48]  D. Shore,et al.  Molecular and genetic analysis of the toxic effect of RAP1 overexpression in yeast. , 1995, Genetics.

[49]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[50]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[51]  Rahul Jaiswal,et al.  Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1-Ter DNA complex. , 2015, Acta crystallographica. Section F, Structural biology communications.

[52]  Y. Doyon,et al.  Precise genome modification in the crop species Zea mays using zinc-finger nucleases , 2009, Nature.

[53]  Qin Ma,et al.  UbiSitePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[54]  Lin Lu,et al.  A novel computational approach to predict transcription factor DNA binding preference. , 2009, Journal of proteome research.

[55]  John P. Overington,et al.  How many drug targets are there? , 2006, Nature Reviews Drug Discovery.

[56]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[57]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[58]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[59]  Saeed Ahmad,et al.  iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition , 2015, Comput. Biol. Medicine.

[61]  Y. Zou,et al.  Functions of human replication protein A (RPA): From DNA replication to DNA damage and stress responses , 2006, Journal of cellular physiology.

[62]  Bo Gao,et al.  Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm , 2016, BMC Bioinformatics.

[63]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[64]  A M Gronenborn,et al.  NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. , 1993, Science.

[65]  Abdollah Dehzangi,et al.  HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features , 2017, BioMed research international.

[66]  Guoli Ji,et al.  Predicting DNA-binding proteins using feature fusion and MSVM-RFE , 2016, 2016 10th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID).

[67]  Xiujun Gong,et al.  A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers , 2018, Genes.

[68]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[69]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[70]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[71]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[72]  Bin Liu,et al.  Identification of DNA-binding proteins by auto-cross covariance transformation , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[73]  Ignacio Anegon,et al.  Knockout Rats via Embryo Microinjection of Zinc-Finger Nucleases , 2009, Science.

[74]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[75]  Vincent Laudet,et al.  Principles for modulation of the nuclear receptor superfamily , 2004, Nature Reviews Drug Discovery.

[76]  Jiuyong Li,et al.  Combined Feature Selection and Cancer Prognosis Using Support Vector Machine Regression , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[77]  R Grosschedl,et al.  HMG domain proteins: architectural elements in the assembly of nucleoprotein structures. , 1994, Trends in genetics : TIG.

[78]  Jie Zhao,et al.  Steel surface defects recognition based on multi-type statistical features and enhanced twin support vector machine , 2017 .

[79]  Farman Ali,et al.  Improving secretory proteins prediction in Mycobacterium tuberculosis using the unbiased dipeptide composition with support vector machine , 2018 .