Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins

Protein–DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments.

[1]  Yaoqi Zhou,et al.  An all‐atom knowledge‐based energy function for protein‐DNA threading, docking decoy discrimination, and prediction of transcription‐factor binding profiles , 2009, Proteins.

[2]  H. Margalit,et al.  A role for CH...O interactions in protein-DNA recognition. , 1998, Journal of molecular biology.

[3]  Harianto Tjong,et al.  DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces , 2007, Nucleic acids research.

[4]  R. Lavery,et al.  Protein-DNA recognition triggered by a DNA conformational switch. , 2011, Angewandte Chemie.

[5]  M. Michael Gromiha,et al.  Scoring Function Based Approach for Locating Binding Sites and Understanding Recognition Mechanism of Protein-DNA Complexes , 2011, J. Chem. Inf. Model..

[6]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[7]  Ozlem Keskin,et al.  Protein–DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins , 2008, Nucleic acids research.

[8]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[9]  Burkhard Rost,et al.  Prediction of DNA-binding residues from sequence , 2007, ISMB/ECCB.

[10]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[11]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[12]  R. H. Austin,et al.  Importance of DNA stiffness in protein–DNA binding specificity , 1987, Nature.

[13]  Ponraj Prabakaran,et al.  Classification of protein-DNA complexes based on structural descriptors. , 2006, Structure.

[14]  M M Gromiha,et al.  Protein secondary structure prediction in different structural classes. , 1998, Protein engineering.

[15]  H. Kono,et al.  Structure‐based prediction of DNA target sites by regulatory proteins , 1999, Proteins.

[16]  Akinori Sarai,et al.  Thermodynamic database for protein-nucleic acid interactions (ProNIT) , 2001, Bioinform..

[17]  H. Margalit,et al.  Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. , 1998, Nucleic acids research.

[18]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[19]  Jason E. Donald,et al.  Energetics of protein–DNA interactions , 2006, Nucleic acids research.

[20]  Samuel Selvaraj,et al.  Intermolecular and intramolecular readout mechanisms in protein-DNA recognition. , 2004, Journal of molecular biology.

[21]  M. Gromiha,et al.  Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. , 2013, Advances in protein chemistry and structural biology.

[22]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[23]  Marianne Rooman,et al.  Cation-pi/H-bond stair motifs at protein-DNA interfaces. , 2002, Journal of molecular biology.

[24]  Hui Lu,et al.  NAPS: a residue-level nucleic acid-binding prediction server , 2010, Nucleic Acids Res..

[25]  Peng Zhou,et al.  Systematic Classification and Analysis of Themes in Protein-DNA Recognition , 2010, J. Chem. Inf. Model..

[26]  Francisco Melo,et al.  The Protein-DNA Interface database , 2010, BMC Bioinformatics.

[27]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[28]  Isabella Daidone,et al.  Mechanism of DNA recognition by the restriction enzyme EcoRV. , 2010, Journal of molecular biology.

[29]  S. Pongor,et al.  The role of DNA bending in Cro protein-DNA interactions. , 1997, Biophysical chemistry.

[30]  N. Bhardwaj,et al.  Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions , 2007, FEBS letters.

[31]  Anatoly B Kolomeisky,et al.  Physics of protein-DNA interactions: mechanisms of facilitated target search. , 2011, Physical chemistry chemical physics : PCCP.

[32]  V. Zhurkin,et al.  DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Igor B. Kuznetsov,et al.  DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins , 2007, Bioinform..

[34]  Kentaro Shimizu,et al.  A new method for evaluating the specificity of indirect readout in protein–DNA recognition , 2012, Nucleic acids research.

[35]  M. Michael Gromiha,et al.  Thermodynamic Database for Protein-Nucleic Acid Interactions , 1999 .

[36]  A. R. Srinivasan,et al.  The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids. , 1992, Biophysical journal.

[37]  M. Michael Gromiha,et al.  Protein Bioinformatics: From Sequence to Function , 2010 .

[38]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[39]  H. Kono,et al.  Protein-DNA recognition patterns and predictions. , 2005, Annual review of biophysics and biomolecular structure.

[40]  P E Bourne,et al.  PRONUC: a software package for the analysis of protein and nucleic acid sequences. , 1987, Computer methods and programs in biomedicine.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  M Michael Gromiha,et al.  Influence of DNA stiffness in protein-DNA recognition. , 2005, Journal of biotechnology.

[43]  Sándor Pongor,et al.  A simple probabilistic scoring method for protein domain identification , 2000, Bioinform..

[44]  A. Das,et al.  Free‐energy component analysis of 40 protein–DNA complexes: A consensus view on the thermodynamics of binding at the molecular level , 2002, J. Comput. Chem..

[45]  Andrey G. Cherstvy Electrostatic interactions in biological DNA-related systems. , 2011, Physical chemistry chemical physics : PCCP.

[46]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[47]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[48]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[49]  Monika Fuxreiter,et al.  Dynamic Protein–dna Recognition: beyond What Can Be Seen , 2022 .

[50]  C. Pabo,et al.  Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? , 2000, Journal of molecular biology.

[51]  Shinn-Ying Ho,et al.  Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method , 2007, Biosyst..

[52]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[53]  Haruki Nakamura,et al.  The Protein Data Bank at 40: reflecting on the past to prepare for the future. , 2012, Structure.

[54]  M. Oda,et al.  Thermodynamic and kinetic analyses for understanding sequence‐specific DNA recognition , 2000, Genes to cells : devoted to molecular & cellular mechanisms.

[55]  Janet M Thornton,et al.  Identifying DNA-binding proteins using structural motifs and the electrostatic potential. , 2004, Nucleic acids research.

[56]  D. Lejeune,et al.  Protein–nucleic acid recognition: Statistical analysis of atomic interactions and influence of DNA structure , 2005, Proteins.

[57]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Syst. Biol..

[58]  K. Schulten,et al.  Recognition of methylated DNA through methyl-CpG binding domain proteins , 2011, Nucleic acids research.

[59]  M. Michael Gromiha,et al.  Influence of cation–π interactions in protein–DNA complexes , 2004 .

[60]  L. Mirny,et al.  Structural analysis of conserved base pairs in protein-DNA complexes. , 2002, Nucleic acids research.

[61]  Seungwoo Hwang,et al.  Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins , 2006, Proteins.

[62]  M. Rooman,et al.  Contribution of cation-pi interactions to the stability of protein-DNA complexes. , 2000, Journal of molecular biology.

[63]  K Nadassy,et al.  Structural features of protein-nucleic acid recognition sites. , 1999, Biochemistry.

[64]  H M Berman,et al.  Protein-DNA interactions: A structural analysis. , 1999, Journal of molecular biology.