RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets

ChIP-seq is increasingly used to characterize transcription factor binding and chromatin marks at a genomic scale. Various tools are now available to extract binding motifs from peak data sets. However, most approaches are only available as command-line programs, or via a website but with size restrictions. We present peak-motifs, a computational pipeline that discovers motifs in peak sequences, compares them with databases, exports putative binding sites for visualization in the UCSC genome browser and generates an extensive report suited for both naive and expert users. It relies on time- and memory-efficient algorithms enabling the treatment of several thousand peaks within minutes. Regarding time efficiency, peak-motifs outperforms all comparable tools by several orders of magnitude. We demonstrate its accuracy by analyzing data sets ranging from 4000 to 1 28 000 peaks for 12 embryonic stem cell-specific transcription factors. In all cases, the program finds the expected motifs and returns additional motifs potentially bound by cofactors. We further apply peak-motifs to discover tissue-specific motifs in peak collections for the p300 transcriptional co-activator. To our knowledge, peak-motifs is the only tool that performs a complete motif analysis and offers a user-friendly web interface without any restriction on sequence size or number of peaks.

[1]  Denis Duboule The function of Hox genes in the morphogenesis of the vertebrate limb. , 1993, Annales de genetique.

[2]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[3]  A. Watt,et al.  Cooperation of Sp1 and p300 in the induction of the CDK inhibitor p21WAF1/CIP1 during NGF-mediated neuronal differentiation , 1999, Oncogene.

[4]  J. van Helden,et al.  Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. , 2000, Nucleic acids research.

[5]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[6]  J. Collado-Vides,et al.  A web site for the computational analysis of yeast regulatory sequences , 2000, Yeast.

[7]  A. Kimura,et al.  Regulation of interaction of the acetyltransferase region of p300 and the DNA‐binding domain of Sp1 on and through DNA binding , 2000, Genes to cells : devoted to molecular & cellular mechanisms.

[8]  John Quackenbush,et al.  Integrating computationally assembled mouse transcript sequences with the Mouse Genome Informatics (MGI) database , 2003, Genome Biology.

[9]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[10]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[11]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[12]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[13]  Jacques van Helden,et al.  RSAT: regulatory sequence analysis tools , 2008, Nucleic Acids Res..

[14]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[15]  Vsevolod J. Makeev,et al.  Motif discovery and motif finding from genome-mapped DNase footprint data , 2009, Bioinform..

[16]  Saurabh Sinha,et al.  A Biophysical Model for Analysis of Transcription Factor Interaction and Binding Site Arrangement from Genome-Wide Binding Data , 2009, PloS one.

[17]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[18]  Qing Zhou,et al.  Identification of Context-Dependent Motifs by Contrasting ChIP Binding Data , 2010, Bioinform..

[19]  Emmanuel Barillot,et al.  De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis , 2010, Nucleic acids research.

[20]  Heidi Dvinge,et al.  PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci , 2010, BMC Bioinformatics.

[21]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[22]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[23]  William Stafford Noble,et al.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions , 2010, PLoS Comput. Biol..

[24]  A. Visel,et al.  ChIP-Seq identification of weakly conserved heart enhancers , 2010, Nature Genetics.

[25]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[26]  Denis Thieffry,et al.  RSAT 2011: regulatory sequence analysis tools , 2011, Nucleic Acids Res..

[27]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[28]  Simon J. van Heeringen,et al.  GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments , 2010, Bioinform..

[29]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[30]  Julio Collado-Vides,et al.  RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) , 2010, Nucleic Acids Res..

[31]  Martha L. Bulyk,et al.  UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein–DNA interactions , 2010, Nucleic Acids Res..