CNV-guided multi-read allocation for ChIP-seq

MOTIVATION In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads. RESULTS We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION Available at http://www.stat.wisc.edu/∼qizhang/ CONTACT : qizhang@stat.wisc.edu or keles@stat.wisc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.

[2]  Dongjun Chung Statistical methods and software for ChIP-Seq data analysis , 2012 .

[3]  Xiaohui Xie,et al.  AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization , 2011, RECOMB.

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  Dario Strbenac,et al.  Copy-number-aware differential analysis of quantitative DNA sequencing data , 2012, Genome research.

[6]  Vladimir B. Bajic,et al.  HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq data , 2013, Bioinform..

[7]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[8]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[9]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[10]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[11]  Colin N. Dewey,et al.  Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data , 2011, PLoS Comput. Biol..

[12]  Jianrong Wang,et al.  A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags , 2010, Bioinform..

[13]  Charles Lee,et al.  Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. , 2006, Genome research.

[14]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[15]  G. Crawford,et al.  DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. , 2010, Cold Spring Harbor protocols.

[16]  Joseph K. Pickrell,et al.  False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions , 2011, Bioinform..

[17]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[18]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[19]  Victor X. Jin,et al.  LOcating Non-Unique matched Tags (LONUT) to Improve the Detection of the Enriched Regions for ChIP-seq Data , 2013, PloS one.

[20]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[21]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[22]  J. Ibrahim,et al.  ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions , 2011, Genome Biology.

[23]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[24]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[25]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[26]  Kristina Toutanova,et al.  Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity , 2011, ACL.