Predicting Target DNA Sequences of DNA-Binding Proteins Based on Unbound Structures

DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state). Given an unbound query protein and a template complex, the proposed method first employs structure alignment to generate synthetic protein-DNA complexes for the query protein. Once a complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on seven DNA-binding proteins, which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Since this work is the first attempt to predict target sequences of DNA-binding proteins from their unbound structures, three types of structural variations that presumably influence the prediction accuracy were examined and discussed. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.

[1]  Ying Xu,et al.  Structure‐based prediction of transcription factor binding sites using a protein‐DNA docking approach , 2008, Proteins.

[2]  J. Ponder,et al.  Force fields for protein simulations. , 2003, Advances in protein chemistry.

[3]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[4]  P. J. Green,et al.  Probability and Statistical Inference , 1978 .

[5]  Aaron Golden,et al.  Transcription factor binding site identification using the self-organizing map , 2005, Bioinform..

[6]  Jeffrey Skolnick,et al.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions , 2008, Nucleic acids research.

[7]  Thierry Langer,et al.  The Protein Data Bank (PDB), its related services and software tools as key components for in silico guided drug discovery. , 2008, Journal of medicinal chemistry.

[8]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[9]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[10]  Yaoqi Zhou,et al.  An all‐atom knowledge‐based energy function for protein‐DNA threading, docking decoy discrimination, and prediction of transcription‐factor binding profiles , 2009, Proteins.

[11]  E. Siggia,et al.  Connecting protein structure with predictions of regulatory sites , 2007, Proceedings of the National Academy of Sciences.

[12]  Aaron Golden,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005, ISMB.

[13]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[14]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[15]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[16]  Eric P Xing,et al.  MotifPrototyper: A Bayesian profile model for motif families , 2004, Proc. Natl. Acad. Sci. USA.

[17]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[18]  D. Baker,et al.  A simple physical model for the prediction and design of protein-DNA interactions. , 2004, Journal of molecular biology.

[19]  Jason E. Donald,et al.  Energetics of protein–DNA interactions , 2006, Nucleic acids research.

[20]  Eric D Siggia,et al.  Computational methods for transcriptional regulation. , 2005, Current opinion in genetics & development.

[21]  Yanay Ofran,et al.  Large‐scale analysis of secondary structure changes in proteins suggests a role for disorder‐to‐order transitions in nucleotide binding proteins , 2010, Proteins.

[22]  Ernest Fraenkel,et al.  Sequence analysis A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data , 2006 .

[23]  Song Liu,et al.  A knowledge-based energy function for protein-ligand, protein-protein, and protein-DNA complexes. , 2005, Journal of medicinal chemistry.

[24]  G. Church,et al.  A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. , 2004, Genome research.

[25]  T. Cheatham,et al.  Molecular dynamics simulation of nucleic acids: Successes, limitations, and promise * , 2000, Biopolymers.

[26]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[27]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[28]  Jun-tao Guo,et al.  Quantitative evaluation of protein–DNA interactions using an optimized knowledge-based potential , 2005, Nucleic acids research.

[29]  Rolf Boelens,et al.  Information-driven protein–DNA docking using HADDOCK: it is a matter of flexibility , 2006, Nucleic acids research.

[30]  M. Kendall Probability and Statistical Inference , 1956, Nature.

[31]  N. Wingreen,et al.  Toward an atomistic model for predicting transcription‐factor binding sites , 2004, Proteins.

[32]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[33]  D. Baker,et al.  Protein–DNA binding specificity predictions with structural models , 2005, Nucleic acids research.