DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding

Motivation: Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF‐DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites. Results: We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein‐DNA binding preference and affinity. This kernel extends an existing class of k‐mer based sequence kernels, based on the recently described di‐mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX‐seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein‐DNA binding affinity. In particular, we observe that (i) the k‐spectrum + shape model performs better than the classical k‐spectrum kernel, particularly for small k values; (ii) the di‐mismatch kernel performs better than the k‐mer kernel, for larger k; and (iii) the di‐mismatch + shape kernel performs better than the di‐mismatch kernel for intermediate k values. Availability and implementation: The software is available at https://bitbucket.org/wenxiu/sequence‐shape.git. Contact: rohs@usc.edu or william‐noble@uw.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[2]  R. Mann,et al.  Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox Proteins , 2011, Cell.

[3]  R. Shamir,et al.  Transcription factor family‐specific DNA shape readout revealed by quantitative specificity models , 2017, Molecular systems biology.

[4]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[5]  Lin Yang,et al.  DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale , 2013, Nucleic Acids Res..

[6]  Duilio Cascio,et al.  The shape of the DNA minor groove directs binding by the DNA-bending protein Fis. , 2010, Genes & development.

[7]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[8]  Lin Yang,et al.  DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. , 2016, Cell systems.

[9]  R. Rohs,et al.  How motif environment influences transcription factor search dynamics: Finding a needle in a haystack , 2016, BioEssays : news and reviews in molecular, cellular and developmental biology.

[10]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[13]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[14]  R. Rohs,et al.  A widespread role of the motif environment in transcription factor binding across diverse protein families , 2015, Genome research.

[15]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[16]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[17]  Z. Yakhini,et al.  Unraveling determinants of transcription factor binding outside the core binding site , 2015, Genome research.

[18]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[19]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[20]  Remo Rohs,et al.  Covariation between homeodomain transcription factors and the shape of their DNA binding sites , 2013, Nucleic acids research.

[21]  Lin Yang,et al.  TFBSshape: a motif database for DNA shape features of transcription factor binding sites , 2013, Nucleic Acids Res..

[22]  William Stafford Noble,et al.  Nucleosome positioning signals in genomic DNA. , 2007, Genome research.

[23]  Eran Segal,et al.  A Feature-Based Approach to Modeling Protein–DNA Interactions , 2007, RECOMB.

[24]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[25]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[26]  I. Korf,et al.  Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing , 2009, Nucleic acids research.

[27]  Lin Yang,et al.  DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding , 2015, Bioinform..

[28]  William Stafford Noble,et al.  Sequence and chromatin determinants of cell-type–specific transcription factor binding , 2012, Genome research.

[29]  Yue Zhao,et al.  Inferring Binding Energies from Selected Binding Sites , 2009, PLoS Comput. Biol..

[30]  Michael Q. Zhang,et al.  Identifying combinatorial regulation of transcription factors and binding motifs , 2004, Genome Biology.

[31]  R. Mann,et al.  Quantitative modeling of transcription factor binding specificities using DNA shape , 2015, Proceedings of the National Academy of Sciences.

[32]  E. Birney,et al.  High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. , 2011, Genome research.

[33]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[34]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[35]  William Stafford Noble,et al.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions , 2010, PLoS Comput. Biol..

[36]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[37]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[38]  M. Bulyk,et al.  Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. , 2013, Cell reports.

[39]  R. Mann,et al.  Deconvolving the Recognition of DNA Shape from Sequence , 2015, Cell.

[40]  Jeff A. Bilmes,et al.  A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data , 2010, Bioinform..

[41]  R. Tjian,et al.  Orchestrated response: a symphony of transcription factors for gene control. , 2000, Genes & development.

[42]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[43]  W. Wong,et al.  CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[45]  R. Mann,et al.  Low Affinity Binding Site Clusters Confer Hox Specificity and Regulatory Robustness , 2015, Cell.