SVM-Based Approach for Predicting DNA-Binding Residues in Proteins from Amino Acid Sequences

Protein-DNA interactions are vitally important in a wide range of biological processes such as gene regulation and DNA replication and repair. We predict DNA-binding residues in proteins from amino acid sequences by support vector machine (SVM) with a novel hybrid feature which incorporates evolutionary information of amino acid sequences and four physical–chemical properties, including the side chain pKa value, hydrophobicity index, molecular mass and lone electron pairs of amino acids. The classifier achieves 79.12% total accuracy with 74.19% sensitivity and 79.20% specificity, respectively. Moreover, an alternative classifier using random forest (RF) is also constructed. Further analysis proves that the hybrid feature shows obvious contribution to our excellent prediction performance, and the evolutionary information contributes most to the prediction improvement.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  Tobias Scheffer,et al.  Error Estimation and Model Selection , 1999, Künstliche Intell..

[4]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[5]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[6]  Yan Wang,et al.  Better prediction of the location of α‐turns in proteins with support vector machine , 2006 .

[7]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[10]  David Ghosh,et al.  Transcription factor therapeutics: long-shot or lodestone. , 2005, Current medicinal chemistry.

[11]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Pilar Blancafort,et al.  Designing Transcription Factor Architectures for Drug Discovery , 2004, Molecular Pharmacology.

[14]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[15]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[16]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .