Shape-based peak identification for ChIP-Seq

BackgroundThe identification of binding targets for proteins using ChIP-Seq has gained popularity as an alternative to ChIP-chip. Sequencing can, in principle, eliminate artifacts associated with microarrays, and cheap sequencing offers the ability to sequence deeply and obtain a comprehensive survey of binding. A number of algorithms have been developed to call "peaks" representing bound regions from mapped reads. Most current algorithms incorporate multiple heuristics, and despite much work it remains difficult to accurately determine individual peaks corresponding to distinct binding events.ResultsOur method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is statistically sound and robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We validate our approach using previously published data and show that it can discover previously missed regions.ConclusionsThe difficulty in accurately calling peaks for ChIP-Seq data is partly due to the difficulty in defining peaks, and we demonstrate a novel method that improves on the accuracy of previous methods in resolving peaks. Our introduction of a robust statistical test based on ideas from topological data analysis is also novel. Our methods are implemented in a program called T-PIC (T ree shape P eak I dentification for C hIP-Seq) is available at http://bio.math.berkeley.edu/tpic/.

[1]  A. Barski,et al.  Genomic location analysis by ChIP‐Seq , 2009, Journal of cellular biochemistry.

[2]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[3]  S. Evans,et al.  Spectra of Large Random Trees , 2009, 0903.3589.

[4]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[5]  G. Grimmett,et al.  Probability and random processes , 2002 .

[6]  S. Evans Probability and Real Trees , 2008 .

[7]  T. Laajala,et al.  A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments , 2009, BMC Genomics.

[8]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[9]  B. Wold,et al.  Sequence census methods for functional genomics , 2008, Nature Methods.

[10]  Clifford A. Meyer,et al.  FoxA1 Translates Epigenetic Signatures into Enhancer-Driven Lineage-Specific Transcription , 2008, Cell.

[11]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[12]  Bertram Ludäscher,et al.  Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data , 2009, Nucleic acids research.

[13]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[14]  James B. Brown,et al.  Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions , 2009, Genome Biology.

[15]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[16]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[17]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[18]  I. Good The joint distribution for the sizes of the generations in a cascade process , 1955, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[19]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[20]  M. Facciotti,et al.  Evaluation of Algorithm Performance in ChIP-Seq Peak Detection , 2010, PloS one.

[21]  Lior Pachter,et al.  Binding Site Turnover Produces Pervasive Quantitative Changes in Transcription Factor Binding between Closely Related Drosophila Species , 2010, PLoS biology.

[22]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[23]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[24]  R. Myers,et al.  An Integrated Software System for Analyzing Chip-chip and Chip-seq Data (supplementary Information) , 2008 .

[25]  M. Noyes,et al.  A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system , 2008, Nucleic acids research.

[26]  Dean H. Fearn Galton-Watson processes with generation dependence , 1972 .

[27]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[28]  Jonghwan Kim,et al.  Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). , 2007, Genome research.

[29]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .

[30]  Lior Pachter,et al.  Coverage statistics for sequence census methods , 2010, BMC Bioinformatics.

[31]  David A. Nix,et al.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[32]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[33]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[34]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.