Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in Motif Finding

Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif finding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on motif finding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.

[1]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[2]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[3]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[5]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[6]  Wing Hung Wong,et al.  Determination of Local Statistical Significance of Patterns in Markov Sequences with Application to Promoter Element Identification , 2004, J. Comput. Biol..

[7]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[8]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[9]  Jun S. Liu,et al.  Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model , 2003 .

[10]  Eric P. Xing,et al.  CSMET: Comparative Genomic Motif Detection via Multi-Resolution Phylogenetic Shadowing , 2008, PLoS Comput. Biol..

[11]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[12]  Richard J Boys,et al.  A Bayesian Approach to DNA Sequence Segmentation , 2004, Biometrics.

[13]  C. Lawrence,et al.  Algorithms for the optimal identification of segment neighborhoods , 1989 .

[14]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[15]  Jun S. Liu,et al.  Decoding human regulatory circuits. , 2004, Genome research.

[16]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[17]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[18]  W. Wong,et al.  Coupling Hidden Markov Models for the Discovery of Cis-Regulatory Modules in Multiple Species , 2007, 0708.4318.

[19]  Jun S. Liu,et al.  Extracting sequence features to predict protein–DNA interactions: a comparative study , 2008, Nucleic acids research.

[20]  H. Müller,et al.  Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation , 2000 .

[21]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[22]  Guy Baele,et al.  A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. , 2008, Systematic biology.

[23]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[24]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[25]  W. Wong,et al.  Computational Biology: Toward Deciphering Gene Regulatory Information in Mammalian Genomes , 2006, Biometrics.

[26]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[27]  Wing H Wong,et al.  Sampling motifs on phylogenetic trees. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[29]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[31]  Dan Xie,et al.  Cross-species de novo identification of cis-regulatory modules with GibbsModule: application to gene regulation in embryonic stem cells. , 2008, Genome research.

[32]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[33]  Michael B. Eisen,et al.  Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures , 2003, Pacific Symposium on Biocomputing.

[34]  J. Mattick,et al.  Genome research , 1990, Nature.

[35]  David Haussler,et al.  Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis , 2004, J. Comput. Biol..

[36]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.