论文信息 - Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data

Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.

[1] M. Reitman,et al. An erythrocyte-specific DNA-binding factor recognizes a regulatory sequence common to all chicken globin genes. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2] Charles Elkan,et al. Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3] Peter M. Fenwick,et al. A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[4] John J. Wyrick,et al. Genome-wide location and function of DNA binding proteins. , 2000, Science.

[5] B. Trask,et al. Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[6] M. Adams,et al. Recent Segmental Duplications in the Human Genome , 2002, Science.

[7] Brad T. Sherman,et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[8] Matthew Hurles,et al. Gene Duplication: The Genomic Trade in Spare Parts , 2004, PLoS biology.

[9] Saeed Tavazoie,et al. Mapping Global Histone Acetylation Patterns to Gene Expression , 2004, Cell.

[10] S. Cawley,et al. Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[11] Kirby D. Johnson,et al. Measurement of protein-DNA interactions in vivo by chromatin immunoprecipitation. , 2004, Methods in molecular biology.

[12] Jean YH Yang,et al. Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[13] Lani F. Wu,et al. Genome-Scale Identification of Nucleosome Positions in S. cerevisiae , 2005, Science.

[14] J. Jurka,et al. Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[15] B. Rovin,et al. The Influence of CCL 3 L 1 Gene – Containing Segmental Duplications on HIV-1 / AIDS Susceptibility , 2009 .

[16] A. Kaur,et al. Interchromosomal segmental duplications explain the unusual structure of PRSS3, the gene for an inhibitor-resistant trypsinogen. , 2005, Molecular biology and evolution.

[17] Eytan Domany,et al. Alu elements contain many binding sites for transcription factors and may play a role in regulation of developmental processes , 2006, BMC Genomics.

[18] Primate segmental duplications: crucibles of evolution, diversity and disease , 2006, Nature Reviews Genetics.

[19] T. Mikkelsen,et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[20] Dustin E. Schones,et al. High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[21] A. Mortazavi,et al. Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[22] D. Haussler,et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53 , 2007, Proceedings of the National Academy of Sciences.

[23] Terrence S. Furey,et al. F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[24] Steven J. M. Jones,et al. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[25] Juliane C. Dohm,et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[26] Geoffrey J Faulkner,et al. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. , 2008, Genomics.

[27] P. Fernández-Salguero,et al. Genome-wide B1 retrotransposon binds the transcription factors dioxin receptor and Slug and regulates gene expression in vivo , 2008, Proceedings of the National Academy of Sciences.

[28] Clifford A. Meyer,et al. Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[29] E. Liu,et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. , 2008, Genome research.

[30] Brad T. Sherman,et al. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[31] Raja Jothi,et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[32] S. Batzoglou,et al. Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[33] Istvan Albert,et al. GeneTrack - a genomic data processing and visualization framework , 2008, Bioinform..

[34] David A. Nix,et al. Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[35] P. Park,et al. Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[36] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[37] C. Feschotte. Transposable elements and the evolution of regulatory networks , 2008, Nature Reviews Genetics.

[38] B. Williams,et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[39] Zhaohui S. Qin,et al. HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data , 2010, BMC Bioinformatics.

[40] Xiaohui Xie,et al. Genome-wide analysis of SREBP-1 binding in mouse liver chromatin reveals a preference for promoter proximal binding to a new motif , 2009, Proceedings of the National Academy of Sciences.

[41] Jon W. Huss,et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources , 2009, Genome Biology.

[42] Francesca Chiaromonte,et al. Erythroid GATA 1 function revealed by genome-wide analysis of transcription factor occupancy , histone modifications , and mRNA expression , 2009 .

[43] G. Tuteja,et al. Extracting transcription factor targets from ChIP-Seq data , 2009, Nucleic acids research.

[44] Simon Tavaré,et al. BayesPeak: Bayesian analysis of ChIP-seq data , 2009, BMC Bioinformatics.

[45] Henriette O'Geen,et al. Discovering hematopoietic mechanisms through genome-wide analysis of GATA factor chromatin occupancy. , 2009, Molecular cell.

[46] Raymond K. Auerbach,et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[47] Raymond K. Auerbach,et al. Mapping accessible chromatin regions using Sono-Seq , 2009, Proceedings of the National Academy of Sciences.

[48] Thomas J. Nicholas,et al. The genomic architecture of segmental duplications and associated copy number variants in dogs. , 2008, Genome research.

[49] E. Eichler,et al. The origins and impact of primate segmental duplications. , 2009, Trends in genetics : TIG.

[50] Mikael Bodén,et al. MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[51] Lovelace J Luquette,et al. Estimating enrichment of repetitive elements from high-throughput sequence data , 2010, Genome Biology.

[52] Bertram Ludäscher,et al. Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data , 2009, Nucleic acids research.

[53] Colin N. Dewey,et al. RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[54] Eran Halperin,et al. Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2010, RECOMB.

[55] Ion I. Mandoiu,et al. Estimation of Alternative Splicing isoform Frequencies from RNA-Seq Data , 2010, WABI.

[56] Jianrong Wang,et al. A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags , 2010, Bioinform..

[57] David J. Arenillas,et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[58] David Haussler,et al. The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[59] Terence P. Speed,et al. Methods for Allocating Ambiguous Short-reads , 2010, Commun. Inf. Syst..

[60] Mary Goldman,et al. The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[61] Sündüz Keleş,et al. A Statistical Framework for the Analysis of ChIP-Seq Data , 2011, Journal of the American Statistical Association.