Direct inference of protein–DNA interactions using compressed sensing methods

Compressed sensing has revolutionized signal acquisition, by enabling complex signals to be measured with remarkable fidelity using a small number of so-called incoherent sensors. We show that molecular interactions, e.g., protein–DNA interactions, can be analyzed in a directly analogous manner and with similarly remarkable results. Specifically, mesoscopic molecular interactions act as incoherent sensors that measure the energies of microscopic interactions between atoms. We combine concepts from compressed sensing and statistical mechanics to determine the interatomic interaction energies of a molecular system exclusively from experimental measurements, resulting in a “de novo” energy potential. In contrast, conventional methods for estimating energy potentials are based on theoretical models premised on a priori assumptions and extensive domain knowledge. We determine the de novo energy potential for pairwise interactions between protein and DNA atoms from (i) experimental measurements of the binding affinity of protein–DNA complexes and (ii) crystal structures of the complexes. We show that the de novo energy potential can be used to predict the binding specificity of proteins to DNA with approximately 90% accuracy, compared to approximately 60% for the best performing alternative computational methods applied to this fundamental problem. This de novo potential method is directly extendable to other biomolecule interaction domains (enzymes and signaling molecule interactions) and to other classes of molecular interactions.

[1]  Tim J. P. Hubbard,et al.  Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster , 2006, PLoS Comput. Biol..

[2]  Yonina C. Eldar,et al.  Introduction to the Issue on Compressive Sensing , 2010, IEEE J. Sel. Top. Signal Process..

[3]  Michele Vendruscolo,et al.  Protein folding: bringing theory and experiment closer together. , 2003, Current opinion in structural biology.

[4]  R. Plasterk,et al.  Cis requirements for transposition of Tc1-like transposons in C. elegans , 1999, Molecular and General Genetics MGG.

[5]  C. Pabo,et al.  Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? , 2000, Journal of molecular biology.

[6]  M. Simon,et al.  Sequence‐specific interaction of the Salmonella Hin recombinase in both major and minor grooves of DNA. , 1992, The EMBO journal.

[7]  Saranyan K. Palaniswamy,et al.  AGRIS and AtRegNet. A Platform to Link cis-Regulatory Elements and Transcription Factors into Regulatory Networks1[W][OA] , 2006, Plant Physiology.

[8]  R. Plasterk,et al.  Target site choice of the related transposable elements Tc1 and Tc3 of Caenorhabditis elegans. , 1994, Nucleic acids research.

[9]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[10]  D. Baker,et al.  Protein–DNA binding specificity predictions with structural models , 2005, Nucleic acids research.

[11]  Julio Collado-Vides,et al.  Prediction of TF target sites based on atomistic models of protein-DNA complexes , 2008, BMC Bioinformatics.

[12]  Sailu Yellaboina,et al.  Computational prediction and experimental verification of novel IdeR binding sites in the upstream sequences of Mycobacterium tuberculosis open reading frames , 2005, Bioinform..

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[15]  L. Mirny,et al.  Structural analysis of conserved base pairs in protein-DNA complexes. , 2002, Nucleic acids research.

[16]  Kenta Nakai,et al.  DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information , 2007, Nucleic Acids Res..

[17]  Hongyi Zhou,et al.  What is a desirable statistical energy functions for proteins and how can it be obtained? , 2007, Cell Biochemistry and Biophysics.

[18]  H Sugisaki,et al.  New restriction endonucleases from Flavobacterium okeanokoites (FokI) and Micrococcus luteus (MluI). , 1981, Gene.

[19]  J. Kuriyan,et al.  High resolution crystal structure of a paired (Pax) class cooperative homeodomain dimer on DNA , 1995, Cell.

[20]  M. Rooman,et al.  Structural classification of HTH DNA-binding domains and protein-DNA interaction modes. , 1996, Journal of molecular biology.

[21]  S. C. Winans,et al.  The quorum‐sensing transcription factor TraR decodes its DNA binding site by direct contacts with DNA bases and by detection of DNA flexibility , 2007, Molecular microbiology.

[22]  M. Levine,et al.  Divergent homeo box proteins recognize similar DNA sequences in Drosophila , 1988, Nature.

[23]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[24]  Gareth M. James,et al.  A generalized Dantzig selector with shrinkage tuning , 2009 .

[25]  O. Kohlbacher,et al.  From sequence to structure and back again: approaches for predicting protein-DNA binding , 2004, Proteome Science.

[26]  Trevor Hastie,et al.  A Closer Look at the Deviance , 1987 .

[27]  M. Eisen All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity , 2005, Genome Biology.

[28]  Martin Schindler,et al.  AthaMap, integrating transcriptional and post-transcriptional data , 2008, Nucleic Acids Res..

[29]  J. Tropp On the conditioning of random subdictionaries , 2008 .

[30]  Ronen Marmorstein,et al.  Structure of the Elk-1–DNA complex reveals how DNA-distal residues affect ETS domain recognition of DNA , 2000, Nature Structural Biology.

[31]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[32]  S. Smale,et al.  Generality of a functional initiator consensus sequence. , 1996, Gene.

[33]  Julio Collado-Vides,et al.  Selection for Unequal Densities of σ70 Promoter-Like Signals in Different Regions of Large Bacterial Genomes , 2006, PLoS genetics.

[34]  Xiang-Jun Lu,et al.  3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures , 2008, Nature Protocols.

[35]  Sarita Ranjan,et al.  Prediction of DtxR regulon: Identification of binding sites and operons controlled by Diphtheria toxin repressor in Corynebacterium diphtheriae , 2004, BMC Microbiology.

[36]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..

[37]  A. A. Griffiths,et al.  Search for additional replication terminators in the Bacillus subtilis 168 chromosome , 1997, Journal of bacteriology.

[38]  J. Potempa,et al.  On the Transcriptional Regulation of Methicillin Resistance , 2004, Journal of Biological Chemistry.

[39]  Nir Friedman,et al.  Ab Initio Prediction of Transcription Factor Targets Using Structural Knowledge , 2005, PLoS Comput. Biol..

[40]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[41]  Julian Tirado-Rives,et al.  Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Patricia J. Johnson,et al.  Analysis of a Ubiquitous Promoter Element in a Primitive Eukaryote: Early Evolution of the Initiator Element , 1999, Molecular and Cellular Biology.

[43]  M. Yaniv,et al.  HNF1, a homeoprotein member of the hepatic transcription regulatory network , 1992, BioEssays : news and reviews in molecular, cellular and developmental biology.

[44]  A. A. Griffiths,et al.  Identification and characterization of new DNA replication terminators in Bacillus subtilis , 1995, Molecular microbiology.

[45]  F. Reif,et al.  Fundamentals of Statistical and Thermal Physics , 1965 .

[46]  N. Grindley,et al.  Contacts between gamma delta resolvase and the gamma delta res site. , 1987, The EMBO journal.

[47]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[48]  R. C. Johnson,et al.  Alignment of recombination sites in Hin-mediated site-specific DNA recombination. , 1991, Genes & development.

[49]  T. Kunkel,et al.  Indirect readout of DNA sequence at the primary-kink site in the CAP-DNA complex: alteration of DNA binding specificity through alteration of DNA kinking. , 2001, Journal of molecular biology.

[50]  J. Ponder,et al.  Force fields for protein simulations. , 2003, Advances in protein chemistry.

[51]  M. Schumacher,et al.  Structural Basis of Core Promoter Recognition in a Primitive Eukaryote , 2003, Cell.

[52]  Julio Collado-Vides,et al.  RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation , 2007, Nucleic Acids Res..

[53]  Christopher M Thomas,et al.  The hierarchy of KorB binding at its 12 binding sites on the broad-host-range plasmid RK2 and modulation of this binding by IncC1 protein. , 2000, Journal of molecular biology.

[54]  Doug Barrick,et al.  An experimentally determined protein folding energy landscape. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Kanazawa Susumu,et al.  New restriction endonucleases from Flavobacterium okeanokoites (FokI) and Micrococcus luteus (MluI). , 1981 .

[56]  P. Wolynes,et al.  The experimental survey of protein-folding energy landscapes , 2005, Quarterly Reviews of Biophysics.

[57]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[58]  Edgar Wingender,et al.  PRODORIC: prokaryotic database of gene regulation , 2003, Nucleic Acids Res..

[59]  E. Moroni,et al.  Identification of DNA-binding protein target sequences by physical effective energy functions: free energy analysis of lambda repressor-DNA complexes. , 2007, BMC structural biology.

[60]  S. Harrison,et al.  Effect of non-contacted bases on the affinity of 434 operator for 434 repressor and Cro , 1987, Nature.

[61]  C. Branden,et al.  Introduction to protein structure , 1991 .

[62]  R. Plasterk,et al.  DNA binding activities of the Caenorhabditis elegans Tc3 transposase. , 1994, Nucleic acids research.

[63]  Philipp Bucher,et al.  HTPSELEX—a database of high-throughput SELEX libraries for transcription factor binding sites , 2006, Nucleic Acids Res..

[64]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[65]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[66]  Inna Dubchak,et al.  RegTransBase—a database of regulatory sequences and interactions in a wide range of prokaryotic genomes , 2006, Nucleic Acids Res..

[67]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[68]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[69]  A. A. Griffiths,et al.  Replication Terminator Protein-Based Replication Fork-Arrest Systems in VariousBacillus Species , 1998, Journal of bacteriology.

[70]  S. Harrison,et al.  The complex between phage 434 repressor DNA-binding domain and operator site OR3: structural differences between consensus and non-consensus half-sites. , 1993, Structure.

[71]  W. Olson,et al.  3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. , 2003, Nucleic acids research.

[72]  J. Moskow,et al.  AbdB-like Hox proteins stabilize DNA binding by the Meis1 homeodomain proteins , 1997, Molecular and cellular biology.

[73]  M. Michael Gromiha,et al.  PINT: Protein–protein Interactions Thermodynamic Database , 2005, Nucleic Acids Res..

[74]  Emmanuel J. Candès,et al.  Quantitative Robust Uncertainty Principles and Optimally Sparse Decompositions , 2004, Found. Comput. Math..

[75]  S. Smale,et al.  DNA sequence requirements for transcriptional initiator activity in mammalian cells. , 1994, Molecular and cellular biology.

[76]  S. Smale,et al.  The initiator element: a paradigm for core promoter heterogeneity within metazoan protein-coding genes. , 1998, Cold Spring Harbor Symposia on Quantitative Biology.

[77]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[78]  E. Siggia,et al.  Connecting protein structure with predictions of regulatory sites , 2007, Proceedings of the National Academy of Sciences.

[79]  S K Burley,et al.  Winged helix proteins. , 2000, Current opinion in structural biology.

[80]  G. Koudelka,et al.  Differential recognition of OR1 and OR3 by bacteriophage 434 repressor and Cro. , 1993, The Journal of biological chemistry.

[81]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[82]  Emmanuel J. Candès,et al.  A Probabilistic and RIPless Theory of Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[83]  Steven M. Gallo,et al.  REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila , 2007, Nucleic Acids Res..

[84]  N. Pavin,et al.  CENP-B box and pJα sequence distribution in human alpha satellite higher-order repeats (HOR) , 2006, Chromosome Research.

[85]  Jason E. Donald,et al.  Energetics of protein–DNA interactions , 2006, Nucleic acids research.