A novel sequence-based method of predicting protein DNA-binding residues, using a machine learning approach

Protein-DNA interactions play an essential role in transcriptional regulation, DNA repair, and many vital biological processes. The mechanism of protein-DNA binding, however, remains unclear. For the study of many diseases, researchers must improve their understanding of the amino acid motifs that recognize DNA. Because identifying these motifs experimentally is expensive and time-consuming, it is necessary to devise an approach for computational prediction. Some in silico methods have been developed, but there are still considerable limitations. In this study, we used a machine learning approach to develop a new sequence-based method of predicting protein-DNA binding residues. To make these predictions, we used the properties of the micro-environment of each amino acid from the AAIndex as well as conservation scores. Testing by the cross-validation method, we obtained an overall accuracy of 94.89%. Our method shows that the amino acid micro-environment is important for DNA binding, and that it is possible to identify the protein-DNA binding sites with it.

[1]  T. Bailey,et al.  High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites , 2008, Nucleic acids research.

[2]  J. Thornton,et al.  An overview of the structures of protein-DNA complexes , 2000, Genome Biology.

[3]  G. Stormo,et al.  Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites , 2008, Cell.

[4]  Fangxue Sherry He,et al.  Systematic identification of mammalian regulatory motifs' target genes and functions , 2008, Nature Methods.

[5]  Wendy S. W. Wong,et al.  Finding cis-regulatory modules in Drosophila using phylogenetic hidden Markov models , 2007, Bioinform..

[6]  Samuel Selvaraj,et al.  Role of inter and intramolecular interactions in protein-DNA recognition. , 2005, Gene.

[7]  J. Thornton,et al.  Searching for functional sites in protein structures. , 2004, Current opinion in chemical biology.

[8]  Seung-Yeon Kim,et al.  Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method , 2005, Bioinform..

[9]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[10]  L. Mirny,et al.  Predicting transcription factor specificity with all-atom models , 2008, Nucleic acids research.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  Liangjiang Wang,et al.  Prediction of Dna-binding Residues from Sequence Features , 2006, J. Bioinform. Comput. Biol..

[13]  A. Fersht,et al.  Rescuing the function of mutant p53 , 2001, Nature Reviews Cancer.

[14]  Lin Lu,et al.  A novel computational approach to predict transcription factor DNA binding preference. , 2009, Journal of proteome research.

[15]  Burkhard Rost,et al.  Prediction of DNA-binding residues from sequence , 2007, ISMB/ECCB.

[16]  T. Vavouri,et al.  Prediction of cis-regulatory elements using binding site matrices--the successes, the failures and the reasons for both. , 2005, Current opinion in genetics & development.

[17]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[18]  H. Seo,et al.  Glutathionylation of Two Cysteine Residues in Paired Domain Regulates DNA Binding Activity of Pax-8* , 2005, Journal of Biological Chemistry.

[19]  D. Schatz,et al.  Identification of basic residues in RAG2 critical for DNA binding by the RAG1-RAG2 complex. , 2001, Molecular cell.

[20]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[21]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[22]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[23]  Nir Friedman,et al.  Ab Initio Prediction of Transcription Factor Targets Using Structural Knowledge , 2005, PLoS Comput. Biol..

[24]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yu-Dong Cai,et al.  A novel computational method to predict transcription factor DNA binding preference. , 2006, Biochemical and biophysical research communications.

[26]  Lee Ann McCue,et al.  Making connections between novel transcription factors and their DNA motifs. , 2005, Genome research.

[27]  A A Salamov,et al.  Protein secondary structure prediction using local alignments. , 1997, Journal of molecular biology.

[28]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[29]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[30]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[31]  Shinn-Ying Ho,et al.  預測蛋白質上去氧核醣核酸鍵結位置 Prediction of DNA-Binding Sites in Proteins , 2006 .

[32]  Roland L. Dunbrack,et al.  Oligomerization of BAK by p53 Utilizes Conserved Residues of the p53 DNA Binding Domain* , 2008, Journal of Biological Chemistry.

[33]  Igor B. Kuznetsov,et al.  DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins , 2007, Bioinform..

[34]  Jeffrey Skolnick,et al.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions , 2008, Nucleic acids research.