A strand specific high resolution normalization method for chip-sequencing data employing multiple experimental control measurements

BackgroundHigh-throughput sequencing is becoming the standard tool for investigating protein-DNA interactions or epigenetic modifications. However, the data generated will always contain noise due to e.g. repetitive regions or non-specific antibody interactions. The noise will appear in the form of a background distribution of reads that must be taken into account in the downstream analysis, for example when detecting enriched regions (peak-calling). Several reported peak-callers can take experimental measurements of background tag distribution into account when analysing a data set. Unfortunately, the background is only used to adjust peak calling and not as a pre-processing step that aims at discerning the signal from the background noise. A normalization procedure that extracts the signal of interest would be of universal use when investigating genomic patterns.ResultsWe formulated such a normalization method based on linear regression and made a proof-of-concept implementation in R and C++. It was tested on simulated as well as on publicly available ChIP-seq data on binding sites for two transcription factors, MAX and FOXA1 and two control samples, Input and IgG. We applied three different peak-callers to (i) raw (un-normalized) data using statistical background models and (ii) raw data with control samples as background and (iii) normalized data without additional control samples as background. The fraction of called regions containing the expected transcription factor binding motif was largest for the normalized data and evaluation with qPCR data for FOXA1 suggested higher sensitivity and specificity using normalized data over raw data with experimental background.ConclusionsThe proposed method can handle several control samples allowing for correction of multiple sources of bias simultaneously. Our evaluation on both synthetic and experimental data suggests that the method is successful in removing background noise.

[1]  Carito Guziolowski,et al.  Algorithms for Molecular Biology , 2007 .

[2]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[3]  B. Widrow,et al.  Adaptive noise cancelling: Principles and applications , 1975 .

[4]  Jan Komorowski,et al.  SICTIN: Rapid footprinting of massively parallel sequencing data , 2010, BioData Mining.

[5]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[6]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[7]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[8]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[9]  B. Lüscher,et al.  Function and regulation of the transcription factors of the Myc/Max/Mad network. , 2001, Gene.

[10]  Steven J. M. Jones,et al.  Genome-wide identification of DNA-protein interactions using chromatin immunoprecipitation coupled with flow cell sequencing. , 2009, The Journal of endocrinology.

[11]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[12]  T. Laajala,et al.  A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments , 2009, BMC Genomics.

[13]  Jan Komorowski,et al.  Differential binding and co-binding pattern of FOXA1 and FOXA3 and their relation to H3K4me3 in HepG2 cells revealed by ChIP-seq , 2009, Genome Biology.

[14]  Simon Anders,et al.  Visualisation of genomic data with the Hilbert curve , 2009 .

[15]  Pearlly Yan,et al.  Comparative study on ChIP-seq data: normalization and binding pattern characterization , 2009, Bioinform..

[16]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[17]  J. Bartůňková,et al.  Department of Immunology , 2009 .

[18]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[19]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.