Modeling the specificity of protein-DNA interactions

The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.

[1]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[2]  Robert Entriken,et al.  Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity , 1984, Nucleic Acids Res..

[3]  S. Arnott,et al.  The ribosome binding sites recognized by E. coli ribosomes have regions with signal character in both the leader and protein coding segments. , 1980, Nucleic acids research.

[4]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[5]  Yue Zhao,et al.  Inferring Binding Energies from Selected Binding Sites , 2009, PLoS Comput. Biol..

[6]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[7]  Walter Gilbert,et al.  E. coli RNA polymerase interacts homologously with two different promoters , 1980, Cell.

[8]  J. Joung,et al.  Profiling the DNA-binding specificities of engineered Cys2His2 zinc finger domains using a rapid cell-based method , 2007, Nucleic acids research.

[9]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[10]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[11]  G. Stormo,et al.  Putting numbers on the network connections. , 2007, BioEssays : news and reviews in molecular, cellular and developmental biology.

[12]  G. Stormo,et al.  A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity , 2011, Nucleic acids research.

[13]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[14]  Gary D. Stormo,et al.  An optimized two-finger archive for ZFN-mediated gene targeting , 2012, Nature Methods.

[15]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[16]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[17]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[18]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[19]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[20]  A. Sarai,et al.  Lambda repressor recognizes the approximately 2-fold symmetric half-operator sequences asymmetrically. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[21]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[22]  S. Quake,et al.  De Novo Identification and Biophysical Characterization of Transcription Factor Binding Sites with Microfluidic Affinity Analysis , 2010, Nature Biotechnology.

[23]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[24]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[25]  P. V. Hippel,et al.  On the Molecular Bases of the Specificity of Interaction of Transcriptional Proteins with Genome DNA , 1979 .

[26]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[27]  Lihua Julie Zhu,et al.  Zinc finger protein-dependent and -independent contributions to the in vivo off-target activity of zinc finger nucleases , 2010, Nucleic Acids Res..

[28]  A. Sarai,et al.  Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[29]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[30]  P. V. von Hippel,et al.  On the specificity of DNA-protein interactions. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[31]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[32]  I. Korf,et al.  Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing , 2009, Nucleic acids research.

[33]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[34]  Zhiping Weng,et al.  Exploring the DNA-recognition potential of homeodomains , 2012, Genome research.

[35]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[36]  D. S. Fields,et al.  Quantitative specificity of the Mnt repressor. , 1997, Journal of molecular biology.

[37]  G A Whitmore,et al.  A Statistical Model for Investigating Binding Probabilities of DNA Nucleotide Sequences Using Microarrays , 2002, Biometrics.

[38]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[39]  Matthew Stephens,et al.  Dissecting the regulatory architecture of gene expression QTLs , 2012, Genome Biology.

[40]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[41]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[42]  G. Stormo Maximally Efficient Modeling of DNA Sequence Motifs at All Levels of Complexity , 2011, Genetics.

[43]  Caspar Zialor DNA sequencing with chain terminating inhibitors , 2014 .

[44]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[45]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[46]  Stephen M. Mount,et al.  A catalogue of splice junction sequences. , 1982, Nucleic acids research.

[47]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[48]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[49]  Martha L Bulyk,et al.  Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex , 2011, Molecular systems biology.

[50]  Nir Friedman,et al.  Ab Initio Prediction of Transcription Factor Targets Using Structural Knowledge , 2005, PLoS Comput. Biol..

[51]  Panayiotis V Benos,et al.  Probabilistic code for DNA recognition by proteins of the EGR family. , 2002, Journal of molecular biology.

[52]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[53]  Amos Tanay,et al.  Extensive low-affinity transcriptional interactions in the yeast genome. , 2006, Genome research.

[54]  William Stafford Noble,et al.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions , 2010, PLoS Comput. Biol..

[55]  T. D. Schneider,et al.  Characterization of Translational Initiation Sites in E. Coui , 1982 .

[56]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[57]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[58]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[59]  D Court,et al.  Regulatory sequences involved in the promotion and termination of RNA transcription. , 1979, Annual review of genetics.

[60]  M. Brodsky,et al.  A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors , 2005, Nature Biotechnology.

[61]  G. Stormo,et al.  Translational initiation in prokaryotes. , 1981, Annual review of microbiology.

[62]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[63]  Gary D. Stormo,et al.  Neural Networks for Determining Protein Specificity and Multiple Alignment of Binding Sites , 1994, ISMB.

[64]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[65]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[66]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[67]  Anthony A. Philippakis,et al.  Design of Compact, Universal DNA Microarrays for Protein Binding Microarray Experiments , 2007, RECOMB.

[68]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[69]  Aaron Klug,et al.  The discovery of zinc fingers and their applications in gene regulation and genome manipulation. , 2010, Annual review of biochemistry.

[70]  Barrett C. Foat,et al.  Predictive modeling of genome-wide mRNA expression: from modules to molecules. , 2007, Annual review of biophysics and biomolecular structure.

[71]  C. Pabo,et al.  DNA recognition by Cys2His2 zinc finger proteins. , 2000, Annual review of biophysics and biomolecular structure.

[72]  P. Bickel,et al.  A model for sequential evolution of ligands by exponential enrichment (SELEX) data , 2012, 1205.1819.

[73]  Martha L. Bulyk,et al.  UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein–DNA interactions , 2010, Nucleic Acids Res..

[74]  Joseph K. Pickrell,et al.  DNaseI sensitivity QTLs are a major determinant of human expression variation , 2011, Nature.

[75]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[76]  Barrett C. Foat,et al.  Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs , 2009, Molecular systems biology.

[77]  S. Quake,et al.  A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors , 2007, Science.

[78]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[79]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[80]  Anirvan M. Sengupta,et al.  A biophysical approach to transcription factor binding site discovery. , 2003, Genome research.

[81]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[82]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[83]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[84]  G. Stormo,et al.  Program in Gene Function and Expression Publications and Presentations Program in Gene Function and Expression 1-8-2013 Using defined finger-finger interfaces as units of assembly for constructing zinc-finger nucleases , 2014 .

[85]  R. Mann,et al.  Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox Proteins , 2011, Cell.

[86]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[87]  Hao Li,et al.  fREDUCE: Detection of degenerate regulatory elements using correlation with expression , 2007, BMC Bioinformatics.

[88]  Shane J. Neph,et al.  Circuitry and Dynamics of Human Transcription Factor Regulatory Networks , 2012, Cell.

[89]  M. Noyes,et al.  A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system , 2008, Nucleic acids research.

[90]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[91]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[92]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[93]  T. D. Schneider,et al.  Quantitative analysis of the relationship between nucleotide sequence and functional activity. , 1986, Nucleic acids research.

[94]  Alexander J. Hartemink,et al.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast , 2007, PLoS Comput. Biol..

[95]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.