Discriminative motif analysis of high-throughput dataset

MOTIVATION High-throughput ChIP-seq studies typically identify thousands of peaks for a single transcription factor (TF). It is common for traditional motif discovery tools to predict motifs that are statistically significant against a naïve background distribution but are of questionable biological relevance. RESULTS We describe a simple yet effective algorithm for discovering differential motifs between two sequence datasets that is effective in eliminating systematic biases and scalable to large datasets. Tested on 207 ENCODE ChIP-seq datasets, our method identifies correct motifs in 78% of the datasets with known motifs, demonstrating improvement in both accuracy and efficiency compared with DREME, another state-of-art discriminative motif discovery tool. More interestingly, on the remaining more challenging datasets, we identify common technical or biological factors that compromise the motif search results and use advanced features of our tool to control for these factors. We also present case studies demonstrating the ability of our method to detect single base pair differences in DNA specificity of two similar TFs. Lastly, we demonstrate discovery of key TF motifs involved in tissue specification by examination of high-throughput DNase accessibility data. AVAILABILITY The motifRG package is publically available via the bioconductor repository. CONTACT yzizhen@fhcrc.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  J. van Helden,et al.  RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets , 2011, Nucleic acids research.

[2]  Martha L Bulyk,et al.  Precise temporal control of the eye regulatory gene Pax6 via enhancer-binding site affinity. , 2010, Genes & development.

[3]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[4]  M. Berger,et al.  Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors , 2009, Nature Protocols.

[5]  Abraham P. Fong,et al.  Genetic and epigenetic determinants of neurogenesis and myogenesis. , 2012, Developmental cell.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[8]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[9]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[10]  Xiaoyu Chen,et al.  RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors , 2007, ISMB/ECCB.

[11]  W. L. Ruzzo,et al.  Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. , 2010, Developmental cell.

[12]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[13]  William Stafford Noble,et al.  Improved similarity scores for comparing motifs , 2011, Bioinform..

[14]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[15]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[16]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[17]  Christopher D. Brown,et al.  Receiver operating characteristics curves and related decision measures: A tutorial , 2006 .

[18]  J. Stamatoyannopoulos,et al.  Chromatin accessibility pre-determines glucocorticoid receptor binding patterns , 2011, Nature Genetics.

[19]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[20]  Michael Q. Zhang,et al.  DNA motifs in human and mouse proximal promoters predict tissue-specific expression. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[22]  Stephan Sauer,et al.  Cohesins Functionally Associate with CTCF on Mammalian Chromosome Arms , 2008, Cell.

[23]  I. Davydov,et al.  The role of NF-Y and IRF-2 in the regulation of human IL-4 gene expression. , 1994, Journal of immunology.

[24]  Tomoki Chiba,et al.  Interplay of transcription factors in T‐cell differentiation and function: the role of Runx , 2011, Immunology.

[25]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[26]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[27]  Hirokazu Chiba,et al.  PeakRegressor Identifies Composite Sequence Motifs Responsible for STAT1 Binding Sites and Their Potential rSNPs , 2010, PloS one.

[28]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..