Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

BackgroundStatistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.ResultsTo address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.ConclusionsThe traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  Wyeth W. Wasserman,et al.  The Next Generation of Transcription Factor Binding Site Prediction , 2013, PLoS Comput. Biol..

[3]  G. Crawford,et al.  A User ' s Guide to the Encyclopedia of DNA Elements , 2011 .

[4]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[5]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[6]  Jens Keilwagen,et al.  A general approach for discriminative de novo motif discovery from high-throughput data , 2013, GCB.

[7]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[8]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[9]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[10]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[11]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[14]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[15]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[16]  J. Keilwagen,et al.  On the Value of Intra-Motif Dependencies of Human Insulator Protein CTCF , 2014, PloS one.

[17]  Victor V Lobanenkov,et al.  A genome-wide map of CTCF multivalency redefines the CTCF code. , 2013, Cell reports.

[18]  Nak-Kyeong Kim,et al.  Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites , 2008, BMC Bioinformatics.

[19]  Ivo Grosse,et al.  Robust learning of inhomogeneous PMMs , 2014, AISTATS.

[20]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[21]  Tomi Silander,et al.  Locally Minimax Optimal Predictive Modeling with Bayesian Networks , 2009, AISTATS.

[22]  Jens Keilwagen,et al.  Varying levels of complexity in transcription factor binding motifs , 2015, Nucleic acids research.

[23]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[24]  Edgar Wingender,et al.  Inhomogeneous Parsimonious Markov Models , 2013, ECML/PKDD.

[25]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[26]  T. Hughes,et al.  Jury remains out on simple models of transcription factor specificity , 2011, Nature Biotechnology.

[27]  Jens Keilwagen,et al.  Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences , 2012, J. Mach. Learn. Res..

[28]  R. Siddharthan Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix , 2010, PloS one.

[29]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[30]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[31]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[32]  S. Nielsen The stochastic EM algorithm: estimation and asymptotic results , 2000 .

[33]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[34]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[35]  Mikko Koivisto,et al.  Dealing with small data: On the generalization of context trees , 2015, ICML.

[36]  T. D. Schneider,et al.  Characterization of Translational Initiation Sites in E. Coui , 1982 .

[37]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[38]  Kara Dolinski,et al.  Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD) , 2004, Nucleic Acids Res..

[39]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[40]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[41]  Lin Yang,et al.  TFBSshape: a motif database for DNA shape features of transcription factor binding sites , 2013, Nucleic Acids Res..

[42]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[43]  Chun-Hsi Huang,et al.  A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data , 2014, Biology Direct.

[44]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[45]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[46]  Anirvan M. Sengupta,et al.  Non-additivity in protein-DNA binding , 2005, Bioinform..

[47]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[48]  Hyunsoo Kim,et al.  Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles , 2011, PloS one.

[49]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[50]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.