论文信息 - On the detection and refinement of transcription factor binding sites using ChIP-Seq data - 字舞流文

On the detection and refinement of transcription factor binding sites using ChIP-Seq data

Coupling chromatin immunoprecipitation (ChIP) with recently developed massively parallel sequencing technologies has enabled genome-wide detection of protein–DNA interactions with unprecedented sensitivity and specificity. This new technology, ChIP-Seq, presents opportunities for in-depth analysis of transcription regulation. In this study, we explore the value of using ChIP-Seq data to better detect and refine transcription factor binding sites (TFBS). We introduce a novel computational algorithm named Hybrid Motif Sampler (HMS), specifically designed for TFBS motif discovery in ChIP-Seq data. We propose a Bayesian model that incorporates sequencing depth information to aid motif identification. Our model also allows intra-motif dependency to describe more accurately the underlying motif pattern. Our algorithm combines stochastic sampling and deterministic ‘greedy’ search steps into a novel hybrid iterative scheme. This combination accelerates the computation process. Simulation studies demonstrate favorable performance of HMS compared to other existing methods. When applying HMS to real ChIP-Seq datasets, we find that (i) the accuracy of existing TFBS motif patterns can be significantly improved; and (ii) there is significant intra-motif dependency inside all the TFBS motifs we tested; modeling these dependencies further improves the accuracy of these TFBS motif patterns. These findings may offer new biological insights into the mechanisms of transcription factor regulation.

Zhaohui S. Qin | Jeremy MG Taylor | Z. Qin | A. Chinnaiyan | Jindan Yu | Ming Hu | Jeremy M. G. Taylor

[1] Alexander Varshavsky,et al. Mapping proteinDNA interactions in vivo with formaldehyde: Evidence that histone H4 is retained on a highly transcribed gene , 1988, Cell.

[2] Rodger Staden,et al. Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[3] G. Stormo,et al. Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[4] T. D. Schneider,et al. Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[5] A. A. Reilly,et al. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[6] Renato Paro,et al. Mapping polycomb-repressed domains in the bithorax complex using in vivo formaldehyde cross-linked chromatin , 1993, Cell.

[7] Jun S. Liu,et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[8] Charles Elkan,et al. Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[9] Jun S. Liu,et al. The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[10] Jun S. Liu,et al. Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[11] Ronald W. Davis,et al. Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[12] Jun S. Liu,et al. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[13] D. Lockhart,et al. Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[14] Michael Gribskov,et al. Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[15] G. Church,et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[16] H. Bussemaker,et al. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[17] John J. Wyrick,et al. Genome-wide location and function of DNA binding proteins. , 2000, Science.

[18] R. Tibshirani,et al. Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19] H. Bussemaker,et al. Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[20] D. Botstein,et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[21] J. Valverde. Molecular Modelling: Principles and Applications , 2001 .

[22] J. Liu,et al. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[23] G. Stormo,et al. Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[24] Douglas L. Brutlag,et al. BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[25] G. Church,et al. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[26] G A Whitmore,et al. A Statistical Model for Investigating Binding Probabilities of DNA Nucleotide Sequences Using Microarrays , 2002, Biometrics.

[27] Jun S. Liu,et al. An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[28] G. Stormo,et al. Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[29] Jun S. Liu,et al. Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model , 2003 .

[30] F. P. Roth,et al. A non-parametric model for transcription factor binding sites. , 2003, Nucleic acids research.

[31] Jun S. Liu,et al. Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32] Shane T. Jensen,et al. BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[33] Qing Zhou,et al. Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[34] A. Wada,et al. The effects of guanine and cytosine variation on dinucleotide frequency and amino acid composition in the human genome , 2005, Journal of Molecular Evolution.

[35] William Stafford Noble,et al. Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[36] T. Mikkelsen,et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[37] Michael Q. Zhang,et al. Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[38] Allen D. Delaney,et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[39] Dustin E. Schones,et al. High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[40] A. Mortazavi,et al. Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[41] Steven J. M. Jones,et al. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[42] Clifford A. Meyer,et al. Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[43] Raja Jothi,et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[44] S. Batzoglou,et al. Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[45] Heejung Shim,et al. Integrating quantitative information from ChIP-chip experiments into motif finding. , 2008, Biostatistics.

[46] David A. Nix,et al. Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[47] P. Park,et al. Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[48] Hyungwon Choi,et al. Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data , 2009, Bioinform..

[49] Raymond K. Auerbach,et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.