Genome‐wide Inference of Transcription Factor–DNA Binding Specificity in Cell Regeneration Using a Combination Strategy

The cell growth, development, and regeneration of tissue and organ are associated with a large number of gene regulation events, which are mediated in part by transcription factors (TFs) binding to cis‐regulatory elements involved in the genome. Predicting the binding affinity and inferring the binding specificity of TF–DNA interactions at the genomic level would be fundamentally helpful for our understanding of the molecular mechanism and biological implication underlying sequence‐specific TF–DNA recognition. In this study, we report the development of a combination method to characterize the interaction behavior of a 11‐mer oligonucleotide segment and its mutations with the Gcn4p protein, a homodimeric, basic leucine zipper TF, and to predict the binding affinity and specificity of potential Gcn4p binders in the genome‐wide scale. In this procedure, a position‐mutated energy matrix is created based on molecular modeling analysis of native and mutated Gcn4p–DNA complex structures to describe the position‐independent interaction energy profile of Gcn4p with different nucleotide types at each position of the oligonucleotide, and the energy terms extracted from the matrix and their interactives are then correlated with experimentally measured affinities of 19 268 distinct oligonucleotides using statistical modeling methodology. Subsequently, the best one of built regression models is successfully applied to screen those of potential high‐affinity Gcn4p binders from the complete genome. The findings arising from this study are briefly listed below: (i) The 11 positions of oligonucleotides are highly interactive and non‐additive in contribution to Gcn4p–DNA binding affinity; (ii) Indirect conformational effects upon nucleotide mutations as well as associated subtle changes in interfacial atomic contacts, but not the direct nonbonded interactions, are primarily responsible for the sequence‐specific recognition; (iii) The intrinsic synergistic effects among the sequence positions of oligonucleotides determine Gcn4p–DNA binding affinity and specificity; (iv) Linear regression models in conjunction with variable selection seem to perform fairly well in capturing the internal dependences hidden in the Gcn4p–DNA system, albeit ignoring nonlinear factors may lead the models to systematically underestimate and overestimate high‐ and low‐affinity samples, respectively.

[1]  C. J. Adkins Thermodynamics and statistical mechanics , 1972, Nature.

[2]  Leland Wilkinson Tests of significance in stepwise regression. , 1979 .

[3]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[4]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[5]  J. Ponder,et al.  An efficient newton‐like method for molecular mechanics energy minimization of large molecules , 1987 .

[6]  Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. , 1989, Molecular and cellular biology.

[7]  G. Camilli,et al.  Comparison of the Mantel-Haenszel Test With a Randomized and a Jackknife Test for Detecting Biased Items , 1990 .

[8]  K. Struhl,et al.  The GCN4 basic region leucine zipper binds DNA as a dimer of uninterrupted α Helices: Crystal structure of the protein-DNA complex , 1992, Cell.

[9]  Larry E. Toothaker,et al.  Multiple Regression: Testing and Interpreting Interactions , 1991 .

[10]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[11]  H. Hurst Transcription factors as drug targets in cancer. , 1996, European journal of cancer.

[12]  W. C. Still,et al.  The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii , 1997 .

[13]  J. Thornton,et al.  NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions. , 1997, Nucleic acids research.

[14]  Junmei Wang,et al.  How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? , 2000, J. Comput. Chem..

[15]  M. Marton,et al.  Transcriptional Profiling Shows that Gcn4p Is a Master Regulator of Gene Expression during Amino Acid Starvation in Yeast , 2001, Molecular and Cellular Biology.

[16]  Sung Jin Cho,et al.  Genetic Algorithm Guided Selection: Variable Selection and Subset Selection , 2002, J. Chem. Inf. Comput. Sci..

[17]  R. Costa,et al.  Transcription factors in liver development, differentiation, and regeneration , 2003, Hepatology.

[18]  Emily A. Smith,et al.  Surface Plasmon Resonance Imaging of Transcription Factor Proteins: Interactions of Bacterial Response Regulators with DNA Arrays on Gold Films† , 2003 .

[19]  Pengyu Y. Ren,et al.  Polarizable Atomic Multipole Water Model for Molecular Mechanics Simulation , 2003 .

[20]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[21]  D. Baker,et al.  A simple physical model for the prediction and design of protein-DNA interactions. , 2004, Journal of molecular biology.

[22]  Ruisheng Zhang,et al.  QSAR Models for the Prediction of Binding Affinities to Human Serum Albumin Using the Heuristic Method and a Support Vector Machine , 2004, J. Chem. Inf. Model..

[23]  N. Wingreen,et al.  Toward an atomistic model for predicting transcription‐factor binding sites , 2004, Proteins.

[24]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[25]  R. Erb,et al.  Introduction to Backpropagation Neural Network Computation , 1993, Pharmaceutical Research.

[26]  R. Taub Liver regeneration: from myth to mechanism , 2004, Nature Reviews Molecular Cell Biology.

[27]  J. Shapiro,et al.  Natural genetic engineering in evolution , 2004, Genetica.

[28]  M. Araúzo-Bravo,et al.  Sequence-dependent conformational energy of DNA derived from molecular dynamics simulations: toward understanding the indirect readout mechanism in protein-DNA recognition. , 2005, Journal of the American Chemical Society.

[29]  Anirvan M. Sengupta,et al.  Non-additivity in protein-DNA binding , 2005, Bioinform..

[30]  Song Liu,et al.  A knowledge-based energy function for protein-ligand, protein-protein, and protein-DNA complexes. , 2005, Journal of medicinal chemistry.

[31]  A. Nadra,et al.  Free Energy Contributions to Direct Readout of a DNA Sequence* , 2005, Journal of Biological Chemistry.

[32]  A. Boulesteix,et al.  Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach , 2005, Theoretical Biology and Medical Modelling.

[33]  D. Baker,et al.  Protein–DNA binding specificity predictions with structural models , 2005, Nucleic acids research.

[34]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[35]  Matthew A. Zapala,et al.  Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables , 2006, Proceedings of the National Academy of Sciences.

[36]  D. Baker,et al.  Computational redesign of endonuclease DNA binding and cleavage specificity , 2006, Nature.

[37]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[38]  David M. Simcha,et al.  Context Specific Transcription Factor Prediction , 2007, Annals of Biomedical Engineering.

[39]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[40]  Jason E. Donald,et al.  Energetics of protein–DNA interactions , 2006, Nucleic acids research.

[41]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[42]  H. Lähdesmäki,et al.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources , 2008, PloS one.

[43]  Peng Zhou,et al.  Gaussian process: an alternative approach for QSAM modeling of peptides , 2008, Amino Acids.

[44]  F. Tian,et al.  Modeling and prediction of binding affinities between the human amphiphysin SH3 domain and its peptide ligands using genetic algorithm‐Gaussian processes , 2008, Biopolymers.

[45]  M. Babu,et al.  Eukaryotic gene regulation in three dimensions and its impact on genome evolution. , 2008, Current opinion in genetics & development.

[46]  T. Jacks,et al.  Genetic and cellular mechanisms of oncogenesis. , 2008, Current opinion in genetics & development.

[47]  S. Humphries,et al.  Characterization of DNA-binding proteins using multiplexed competitor EMSA. , 2009, Journal of molecular biology.

[48]  F. Tian,et al.  Modeling and prediction of retention behavior of histidine-containing peptides in immobilized metal-affinity chromatography. , 2009, Journal of separation science.

[49]  D. Case,et al.  A systematic molecular dynamics study of nearest-neighbor effects on base pair and base pair step conformations and fluctuations in B-DNA , 2009, Nucleic acids research.

[50]  S. Quake,et al.  De Novo Identification and Biophysical Characterization of Transcription Factor Binding Sites with Microfluidic Affinity Analysis , 2010, Nature Biotechnology.

[51]  Caleb B. McDonald,et al.  Dissecting the role of leucine zippers in the binding of bZIP domains of Jun transcription factor to DNA. , 2010, Biochemical and biophysical research communications.

[52]  Peng Zhou,et al.  Systematic Classification and Analysis of Themes in Protein-DNA Recognition , 2010, J. Chem. Inf. Model..

[53]  T. Hughes,et al.  Jury remains out on simple models of transcription factor specificity , 2011, Nature Biotechnology.

[54]  Hyunsoo Kim,et al.  Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles , 2011, PloS one.

[55]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[56]  G. Nagy,et al.  Theoretical design of a specific DNA-Zinc-finger protein interaction with semi-empirical quantum chemical methods. , 2011, Journal of molecular graphics & modelling.