ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data

BackgroundChIP-Seq is a powerful tool for identifying the interaction between genomic regulators and their bound DNAs, especially for locating transcription factor binding sites. However, high cost and high rate of false discovery of transcription factor binding sites identified from ChIP-Seq data significantly limit its application.ResultsHere we report a new algorithm, ChIP-PaM, for identifying transcription factor target regions in ChIP-Seq datasets. This algorithm makes full use of a protein-DNA binding pattern by capitalizing on three lines of evidence: 1) the tag count modelling at the peak position, 2) pa ttern m atching of a specific tag count distribution, and 3) motif searching along the genome. A novel data-based two-step eFDR procedure is proposed to integrate the three lines of evidence to determine significantly enriched regions. Our algorithm requires no technical controls and efficiently discriminates falsely enriched regions from regions enriched by true transcription factor (TF) binding on the basis of ChIP-Seq data only. An analysis of real genomic data is presented to demonstrate our method.ConclusionsIn a comparison with other existing methods, we found that our algorithm provides more accurate binding site discovery while maintaining comparable statistical power.

[1]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[2]  J. Elf,et al.  Probing Transcription Factor Dynamics at the Single-Molecule Level in a Living Cell , 2007, Science.

[3]  G. Tuteja,et al.  Extracting transcription factor targets from ChIP-Seq data , 2009, Nucleic acids research.

[4]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[5]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[6]  D S Latchman,et al.  Eukaryotic transcription factors. , 1990, The Biochemical journal.

[7]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[8]  Karan P. Singh,et al.  Theoretical Biology and Medical Modelling , 2007 .

[9]  Phillip Nagley,et al.  Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR-based assay: lack of change of copy number with age. , 2003, Nucleic acids research.

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[12]  Simon Tavaré,et al.  BayesPeak: Bayesian analysis of ChIP-seq data , 2009, BMC Bioinformatics.

[13]  Monya Baker,et al.  Epigenome: mapping in motion , 2010, Nature Methods.

[14]  David A. Nix,et al.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[15]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[16]  田中 勝人 D. B. Percival and A. T. Walden: Wavelet Methods for Time Series Analysis, Camb. Ser. Stat. Probab. Math., 4, Cambridge Univ. Press, 2000年,xxvi + 594ページ. , 2009 .

[17]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[18]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[19]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[20]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[21]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[22]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[23]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[24]  Michael R. Chernick,et al.  Wavelet Methods for Time Series Analysis , 2001, Technometrics.