nDNA-prot: identification of DNA-binding proteins based on unbalanced classification

BackgroundDNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.ResultsIn this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.ConclusionsOur method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.

[1]  Q. Zou,et al.  enDNA-Prot: Identification of DNA-Binding Proteins by Applying Ensemble Learning , 2014, BioMed research international.

[2]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[3]  A. Godzik,et al.  Sequence clustering strategies improve remote homology recognitions while reducing search times. , 2002, Protein engineering.

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[6]  Yi Jiang,et al.  BinMemPredict: a Web Server and Software for Predicting Membrane Protein Types , 2013 .

[7]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[8]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[9]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[10]  Yu-Dong Cai,et al.  A novel computational method to predict transcription factor DNA binding preference. , 2006, Biochemical and biophysical research communications.

[11]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[12]  N. Bhardwaj,et al.  Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions , 2007, FEBS letters.

[13]  Q Zou,et al.  Improved method for predicting protein fold patterns with ensemble classifiers. , 2012, Genetics and molecular research : GMR.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  C. Ding,et al.  Gene selection algorithm by combining reliefF and mRMR , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[16]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[17]  Mamoon Rashid,et al.  Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs , 2007, BMC Bioinformatics.

[18]  Michel Schneider,et al.  UniProtKB/Swiss-Prot. , 2007, Methods in molecular biology.

[19]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[20]  Tatsuya Akutsu,et al.  Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology , 2009, BMC Bioinformatics.

[21]  Q. Zou,et al.  A Global Characterization and Identification of Multifunctional Enzymes , 2012, PloS one.

[22]  G Schneider,et al.  Artificial neural networks for computer-based molecular design. , 1998, Progress in biophysics and molecular biology.

[23]  Long Cheng,et al.  Recurrent Neural Network for Non-Smooth Convex Optimization Problems With Application to the Identification of Genetic Regulatory Networks , 2011, IEEE Transactions on Neural Networks.

[24]  Matthias Keil,et al.  Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network , 2004, J. Comput. Chem..

[25]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[26]  Sonu Kumar,et al.  ZiF-Predict: A Web Tool for Predicting DNA-Binding Specificity in C2H2 Zinc Finger Proteins , 2010, Genom. Proteom. Bioinform..

[27]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[28]  Loris Nanni,et al.  An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins , 2008, Amino Acids.

[29]  Christina S. Leslie,et al.  iDBPs: a web server for the identification of DNA binding proteins , 2010, Bioinform..

[30]  Xiao Sun,et al.  Sequence-Based Prediction of DNA-Binding Residues in Proteins with Conservation and Correlation Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[32]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[33]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[34]  Lin Lu,et al.  A novel computational approach to predict transcription factor DNA binding preference. , 2009, Journal of proteome research.

[35]  B. Clark,et al.  Rapid identification of DNA-binding proteins by mass spectrometry , 1999, Nature Biotechnology.

[36]  Pradeep Kumar Naik,et al.  BINARY CLASSIFICATION OF UNCHARACTERIZED PROTEINS INTO DNA BINDING/NON-DNA BINDING PROTEINS FROM SEQUENCE DERIVED FEATURES USING ANN , 2009 .