Prediction of DNA-binding protein using random forest and elastic net

Recognition of DNA-binding protein is a very meaningful work, because DNA-binding proteins act as the very vital roles in many biological processes. In order to reveal the inner connection between intrinsic information of protein and the binding force of DNA and protein, a 314-dimensional vector is inputted for DNA-binding protein prediction. And all the 314 dimensional values are identified as the more vital digital feature and coded from the multiple properties of protein. A large number of mathematical experiments are performed with 5-fold cross validation test to find the optimal parameters and construct the available models with random forest and elastic net. The numeric features of box-counting dimension, information entropies of chaos game representation and information entropies of dipeptide composition are regarded as more crucial roles showed by a large number of experiments. The performance of random forest model and elastic net model of this study is slightly better than the one of DNA-Prot for test dataset. The Matthew's correlation coefficient (MCC) is 0.7374 and 0.7591 and accuracy (ACC) achieves respectively 0.8750 and 0.8698. For independent dataset1 and independent dataset2 it gains slightly lower MCC and ACC value than DNA-Prot [1].

[1]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[2]  Yong Wang,et al.  Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context , 2011, BMC Systems Biology.

[3]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[4]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[5]  Yael Mandel-Gutfreund,et al.  Annotating nucleic acid-binding function based on protein structure. , 2003, Journal of molecular biology.

[6]  C. Sparrow The Fractal Geometry of Nature , 1984 .

[7]  Xiao-hui Niu,et al.  Predicting DNA binding proteins using support vector machine with hybrid fractal features. , 2014, Journal of theoretical biology.

[8]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[9]  Niu Xiaohui,et al.  Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory. , 2013, Journal of theoretical biology.

[10]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[11]  S. Basu,et al.  Chaos game representation of proteins. , 1997, Journal of molecular graphics & modelling.

[12]  Ondrej Kuzelka,et al.  Prediction of DNA-binding propensity of proteins by the ball-histogram method using automatic template search , 2011, BMC Bioinformatics.

[13]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[14]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[15]  R. Langlois,et al.  Boosting the prediction and understanding of DNA-binding domains from sequence , 2010, Nucleic acids research.

[16]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[17]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[18]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[19]  Yaoqi Zhou,et al.  Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function , 2010, Bioinform..

[20]  Feng Shi,et al.  Hilbert Huang Transform for Predicting Apoptosis Proteins Types , 2007, 2007 1st International Conference on Bioinformatics and Biomedical Engineering.