Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis

BackgroundThere has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking.ResultsWe present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others.ConclusionsThese pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing.

[1]  Søren Brunak,et al.  Facilitating the use of large-scale biological data and tools in the era of translational bioinformatics , 2014, Briefings Bioinform..

[2]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[3]  David S. Lapointe,et al.  ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data , 2010, BMC Bioinformatics.

[4]  Jie Zhang,et al.  Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data , 2013, PLoS Comput. Biol..

[5]  Tao Ye,et al.  seqMINER: an integrated ChIP-seq data interpretation platform , 2010, Nucleic acids research.

[6]  Eric Nestler,et al.  ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases , 2014, BMC Genomics.

[7]  Clifford A. Meyer,et al.  Cistrome: an integrative platform for transcriptional regulation studies , 2011, Genome Biology.

[8]  J. Stamatoyannopoulos,et al.  Chromatin accessibility pre-determines glucocorticoid receptor binding patterns , 2011, Nature Genetics.

[9]  J. Zeitlinger,et al.  A computational pipeline for comparative ChIP-seq analyses , 2011, Nature Protocols.

[10]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[11]  Peter J. Woolf,et al.  GAGE: generally applicable gene set enrichment for pathway analysis , 2009, BMC Bioinformatics.

[12]  Gordon Robertson,et al.  An Integrated Pipeline for the Genome-Wide Analysis of Transcription Factor Binding Sites from ChIP-Seq , 2011, PloS one.

[13]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[14]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[15]  A. Doria Home , 2016, The Jerrie Mock Story.

[16]  Bertram Ludäscher,et al.  Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data , 2009, Nucleic acids research.

[17]  Michael Q. Zhang,et al.  Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data , 2012, PLoS Comput. Biol..

[18]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[19]  William Stafford Noble,et al.  Motif-based analysis of large nucleotide data sets using MEME-ChIP , 2014, Nature Protocols.

[20]  Jianwu Wang,et al.  A Framework for Distributed Data-Parallel Execution in the Kepler Scientific Workflow System , 2012, ICCS.

[21]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[22]  Hanfei Sun,et al.  Target analysis by integration of transcriptome and ChIP-seq data with BETA , 2013, Nature Protocols.

[23]  Jianrong Wang,et al.  BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets , 2013, Bioinform..

[24]  Chen Zeng,et al.  A clustering approach for identification of enriched domains from histone modification ChIP-Seq data , 2009, Bioinform..

[25]  Ernest Fraenkel,et al.  Insights into GATA-1-mediated gene activation versus repression via genome-wide chromatin occupancy analysis. , 2009, Molecular cell.

[26]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[27]  Alberto Termanini,et al.  Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data , 2011, Biology Direct.

[28]  D. Steensma,et al.  Congenital erythropoietic porphyria due to a mutation in GATA1: the first trans-acting mutation causative for a human porphyria. , 2007, Blood.

[29]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[30]  Bertram Ludäscher,et al.  Workflows for microarray data processing in the Kepler environment , 2012, BMC Bioinformatics.

[31]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[32]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[33]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[34]  Mark Bieda Kepler for 'Omics Bioinformatics , 2012, ICCS.

[35]  Gos Micklem,et al.  Supporting Online Material Materials and Methods Figs. S1 to S50 Tables S1 to S18 References Identification of Functional Elements and Regulatory Circuits by Drosophila Modencode , 2022 .

[36]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[37]  Ying Li,et al.  HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data , 2014, BMC Bioinformatics.

[38]  M. Bieda,et al.  Differences among brain tumor stem cell types and fetal neural stem cells in focal regions of histone modifications and DNA methylation, broad regions of modifications, and bivalent promoters , 2014, BMC Genomics.

[39]  Weijun Luo,et al.  Pathview: an R/Bioconductor package for pathway-based data integration and visualization , 2013, Bioinform..

[40]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.