Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.

[1]  M. Reitman,et al.  An erythrocyte-specific DNA-binding factor recognizes a regulatory sequence common to all chicken globin genes. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[4]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[5]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[6]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[7]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[8]  Matthew Hurles,et al.  Gene Duplication: The Genomic Trade in Spare Parts , 2004, PLoS biology.

[9]  Saeed Tavazoie,et al.  Mapping Global Histone Acetylation Patterns to Gene Expression , 2004, Cell.

[10]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[11]  Kirby D. Johnson,et al.  Measurement of protein-DNA interactions in vivo by chromatin immunoprecipitation. , 2004, Methods in molecular biology.

[12]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[13]  Lani F. Wu,et al.  Genome-Scale Identification of Nucleosome Positions in S. cerevisiae , 2005, Science.

[14]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[15]  B. Rovin,et al.  The Influence of CCL 3 L 1 Gene – Containing Segmental Duplications on HIV-1 / AIDS Susceptibility , 2009 .

[16]  A. Kaur,et al.  Interchromosomal segmental duplications explain the unusual structure of PRSS3, the gene for an inhibitor-resistant trypsinogen. , 2005, Molecular biology and evolution.

[17]  Eytan Domany,et al.  Alu elements contain many binding sites for transcription factors and may play a role in regulation of developmental processes , 2006, BMC Genomics.

[18]  Primate segmental duplications: crucibles of evolution, diversity and disease , 2006, Nature Reviews Genetics.

[19]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[20]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[21]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[22]  D. Haussler,et al.  Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53 , 2007, Proceedings of the National Academy of Sciences.

[23]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[24]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[25]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[26]  Geoffrey J Faulkner,et al.  A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. , 2008, Genomics.

[27]  P. Fernández-Salguero,et al.  Genome-wide B1 retrotransposon binds the transcription factors dioxin receptor and Slug and regulates gene expression in vivo , 2008, Proceedings of the National Academy of Sciences.

[28]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[29]  E. Liu,et al.  Evolution of the mammalian transcription factor binding repertoire via transposable elements. , 2008, Genome research.

[30]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[31]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[32]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[33]  Istvan Albert,et al.  GeneTrack - a genomic data processing and visualization framework , 2008, Bioinform..

[34]  David A. Nix,et al.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[35]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[36]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[37]  C. Feschotte Transposable elements and the evolution of regulatory networks , 2008, Nature Reviews Genetics.

[38]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[39]  Zhaohui S. Qin,et al.  HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data , 2010, BMC Bioinformatics.

[40]  Xiaohui Xie,et al.  Genome-wide analysis of SREBP-1 binding in mouse liver chromatin reveals a preference for promoter proximal binding to a new motif , 2009, Proceedings of the National Academy of Sciences.

[41]  Jon W. Huss,et al.  BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources , 2009, Genome Biology.

[42]  Francesca Chiaromonte,et al.  Erythroid GATA 1 function revealed by genome-wide analysis of transcription factor occupancy , histone modifications , and mRNA expression , 2009 .

[43]  G. Tuteja,et al.  Extracting transcription factor targets from ChIP-Seq data , 2009, Nucleic acids research.

[44]  Simon Tavaré,et al.  BayesPeak: Bayesian analysis of ChIP-seq data , 2009, BMC Bioinformatics.

[45]  Henriette O'Geen,et al.  Discovering hematopoietic mechanisms through genome-wide analysis of GATA factor chromatin occupancy. , 2009, Molecular cell.

[46]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[47]  Raymond K. Auerbach,et al.  Mapping accessible chromatin regions using Sono-Seq , 2009, Proceedings of the National Academy of Sciences.

[48]  Thomas J. Nicholas,et al.  The genomic architecture of segmental duplications and associated copy number variants in dogs. , 2008, Genome research.

[49]  E. Eichler,et al.  The origins and impact of primate segmental duplications. , 2009, Trends in genetics : TIG.

[50]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[51]  Lovelace J Luquette,et al.  Estimating enrichment of repetitive elements from high-throughput sequence data , 2010, Genome Biology.

[52]  Bertram Ludäscher,et al.  Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data , 2009, Nucleic acids research.

[53]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[54]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2010, RECOMB.

[55]  Ion I. Mandoiu,et al.  Estimation of Alternative Splicing isoform Frequencies from RNA-Seq Data , 2010, WABI.

[56]  Jianrong Wang,et al.  A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags , 2010, Bioinform..

[57]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[58]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[59]  Terence P. Speed,et al.  Methods for Allocating Ambiguous Short-reads , 2010, Commun. Inf. Syst..

[60]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[61]  Sündüz Keleş,et al.  A Statistical Framework for the Analysis of ChIP-Seq Data , 2011, Journal of the American Statistical Association.