newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation

Identification of DNA-binding proteins is essential in studying cellular activities as the DNA-binding proteins play a pivotal role in gene regulation. In this study, we propose newDNA-Prot, a DNA-binding protein predictor that employs support vector machine classifier and a comprehensive feature representation. The sequence representation are categorized into 6 groups: primary sequence based, evolutionary profile based, predicted secondary structure based, predicted relative solvent accessibility based, physicochemical property based and biological function based features. The mRMR, wrapper and two-stage feature selection methods are employed for removing irrelevant features and reducing redundant features. Experiments demonstrate that the two-stage method performs better than the mRMR and wrapper methods. We also perform a statistical analysis on the selected features and results show that more than 95% of the selected features are statistically significant and they cover all 6 feature groups. The newDNA-Prot method is compared with several state of the art algorithms, including iDNA-Prot, DNAbinder and DNA-Prot. The results demonstrate that newDNA-Prot method outperforms the iDNA-Prot, DNAbinder and DNA-Prot methods. More specific, newDNA-Prot improves the runner-up method, DNA-Prot for around 10% on several evaluation measures. The proposed newDNA-Prot method is available at http://sourceforge.net/projects/newdnaprot/

[1]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Shinn-Ying Ho,et al.  Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties , 2011, BMC Bioinformatics.

[4]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[5]  D. Shore,et al.  Molecular and genetic analysis of the toxic effect of RAP1 overexpression in yeast. , 1995, Genetics.

[6]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[7]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[8]  M. Michael Gromiha,et al.  Functional discrimination of membrane proteins using machine learning techniques , 2008, BMC Bioinformatics.

[9]  Loris Nanni,et al.  High performance set of PseAAC and sequence based descriptors for protein classification. , 2010, Journal of theoretical biology.

[10]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[11]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[12]  Harianto Tjong,et al.  DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces , 2007, Nucleic acids research.

[13]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[14]  Xiang-Sun Zhang,et al.  Bridging protein local structures and protein functions , 2008, Amino Acids.

[15]  Lukasz Kurgan,et al.  ATPsite: sequence-based prediction of ATP-binding residues , 2011, Proteome Science.

[16]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[17]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[18]  Kenji Mizuguchi,et al.  Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks , 2009, BMC Structural Biology.

[19]  M. Gromiha,et al.  Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. , 2013, Advances in protein chemistry and structural biology.

[20]  Lukasz A. Kurgan,et al.  Prediction of beta-turns at over 80% accuracy based on an ensemble of predicted secondary structures and multiple alignments , 2008, BMC Bioinformatics.

[21]  Jeffrey Skolnick,et al.  Efficient prediction of nucleic acid binding function from low-resolution protein structures. , 2006, Journal of molecular biology.

[22]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[23]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[24]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[25]  Xiaohe Shi,et al.  Prediction of interactiveness of proteins and nucleic acids based on feature selections , 2010, Molecular Diversity.

[26]  Ziding Zhang,et al.  Descriptor‐based protein remote homology identification , 2005, Protein science : a publication of the Protein Society.

[27]  Tatsuya Akutsu,et al.  Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology , 2009, BMC Bioinformatics.

[28]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[29]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[30]  Ling Jing,et al.  Predicting DNA- and RNA-binding proteins from sequences with kernel methods. , 2009, Journal of theoretical biology.

[31]  David Eisenberg,et al.  DPANN: Improved sequence to structure alignments following fold recognition , 2004, Proteins.

[32]  Yu-Dong Cai,et al.  A novel computational method to predict transcription factor DNA binding preference. , 2006, Biochemical and biophysical research communications.

[33]  L. S. Klig,et al.  Flexibility of dna binding domain of trp repressor required for recognition of different operator sequences , 1996, Protein science : a publication of the Protein Society.

[34]  F. Cajone,et al.  4-Hydroxynonenal induces a DNA-binding protein similar to the heat-shock factor. , 1989, The Biochemical journal.

[35]  Jiangning Song,et al.  Predicting residue-wise contact orders in proteins by support vector regression , 2006, BMC Bioinformatics.

[36]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[37]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[38]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[39]  C. Chou,et al.  Crystal Structure of the Hyperthermophilic Archaeal DNA-Binding Protein Sso10b2 at a Resolution of 1.85 Angstroms , 2003, Journal of bacteriology.

[40]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[41]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[42]  Y. Levy Intrinsically disordered regions as affinity tuners in protein–DNA interactions , 2012 .

[43]  Jeffrey Skolnick,et al.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome , 2009, PLoS Comput. Biol..

[44]  Yael Mandel-Gutfreund,et al.  Annotating nucleic acid-binding function based on protein structure. , 2003, Journal of molecular biology.

[45]  Lukasz A. Kurgan,et al.  Sequence-based prediction of protein crystallization, purification and production propensity , 2011, Bioinform..

[46]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[47]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[48]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[49]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[50]  Lukasz A. Kurgan,et al.  Sequence based residue depth prediction using evolutionary information and predicted secondary structure , 2008, BMC Bioinformatics.

[51]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[52]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[53]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[54]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[55]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[56]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[57]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[58]  Matthias Keil,et al.  Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network , 2004, J. Comput. Chem..

[59]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[60]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Yaoqi Zhou,et al.  BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences , 2012, PloS one.

[62]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[63]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[64]  Markus Brameier,et al.  Automatic discovery of cross-family sequence features associated with protein function , 2006, BMC Bioinformatics.

[65]  Lukasz A. Kurgan,et al.  Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources , 2010, Bioinform..