Improved detection of DNA-binding proteins via compression technology on PSSM information

Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix—Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix—Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.

[1]  Loris Nanni,et al.  Combing ontologies and dipeptide composition for predicting DNA-binding proteins , 2007, Amino Acids.

[2]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[3]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[4]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome , 2009, PLoS Comput. Biol..

[5]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[6]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[7]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[8]  Ondrej Kuzelka,et al.  Prediction of DNA-binding propensity of proteins by the ball-histogram method using automatic template search , 2011, BMC Bioinformatics.

[9]  Xiao Sun,et al.  DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues , 2016, PloS one.

[10]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[11]  Loris Nanni,et al.  An Empirical Study of Different Approaches for Protein Classification , 2014, TheScientificWorldJournal.

[12]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[13]  Jijun Tang,et al.  Analysis of Co-Associated Transcription Factors via Ordered Adjacency Differences on Motif Distribution , 2017, Scientific Reports.

[14]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[15]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[16]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[19]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[20]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[21]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[22]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[23]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[24]  N. Bhardwaj,et al.  Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions , 2007, FEBS letters.

[25]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[26]  Loris Nanni,et al.  An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins , 2008, Amino Acids.

[27]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[28]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[29]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[30]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[31]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[32]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[33]  Ling Jing,et al.  Predicting DNA- and RNA-binding proteins from sequences with kernel methods. , 2009, Journal of theoretical biology.

[34]  Yael Mandel-Gutfreund,et al.  Annotating nucleic acid-binding function based on protein structure. , 2003, Journal of molecular biology.

[35]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[36]  T. Lane,et al.  Exploiting Amino Acid Composition for Predicting Protein-Protein Interactions , 2009, PloS one.

[37]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[38]  Loris Nanni,et al.  Wavelet images and Chou’s pseudo amino acid composition for protein classification , 2011, Amino Acids.

[39]  Yun Gao,et al.  Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence , 2011 .

[40]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[43]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[44]  E. Huitema,et al.  DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool , 2015, Nucleic acids research.

[45]  David Zhang,et al.  Feature selection and analysis on correlated gas sensor data with recursive feature elimination , 2015 .

[46]  Jie Gui,et al.  Prediction of protein-protein interactions from protein sequence using local descriptors. , 2010, Protein and peptide letters.

[47]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[48]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[49]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[50]  Q. Zou,et al.  enDNA-Prot: Identification of DNA-Binding Proteins by Applying Ensemble Learning , 2014, BioMed research international.

[51]  Witold Pedrycz,et al.  ANFIS-based fuzzy systems for searching dna-protein binding sites , 2016 .

[52]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[53]  Dianhui Wang,et al.  Modelling the transcription factor DNA-binding affinity using genome-wide ChIP-based data , 2016, bioRxiv.

[54]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[55]  Hong Yan,et al.  Prediction of DNA-binding protein based on statistical and geometric features and support vector machines , 2011, Proteome Science.

[56]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[57]  Pradeep Kumar Naik,et al.  BINARY CLASSIFICATION OF UNCHARACTERIZED PROTEINS INTO DNA BINDING/NON-DNA BINDING PROTEINS FROM SEQUENCE DERIVED FEATURES USING ANN , 2009 .

[58]  H. Mohabatkar,et al.  Predicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames test. , 2014, Journal of theoretical biology.

[59]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.

[60]  Bin Liu,et al.  Identification of DNA-binding proteins by auto-cross covariance transformation , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[61]  Yixue Li,et al.  Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. , 2006, Journal of theoretical biology.

[62]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[63]  Jeffrey Skolnick,et al.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions , 2008, Nucleic acids research.