A Sequential Monte Carlo Method for Motif Discovery

We propose a sequential Monte Carlo (SMC)-based motif discovery algorithm that can efficiently detect motifs in datasets containing a large number of sequences. The statistical distribution of the motifs is modeled by an underlying position weight matrix (PWM), and both the PWM and the positions of the motifs within the sequences are estimated by the SMC algorithm. The proposed SMC motif discovery technique can locate motifs under a number of scenarios, including the single-block model, two-block model with unknown gap length, motifs of unknown lengths, motifs with unknown abundance, and sequences with multiple unique motifs. The accuracy of the SMC motif discovery algorithm is shown to be superior to that of the existing methods based on MCMC or EM algorithms. Furthermore, it is shown that the proposed method can be used to improve the results of existing motif discovery algorithms by using their results as the priors for the SMC algorithm.

[1]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[2]  Xiaodong Wang,et al.  A profile-based deterministic sequential Monte Carlo algorithm for motif discovery , 2008, Bioinform..

[3]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[4]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[5]  R. Dickerson,et al.  How proteins recognize the TATA box. , 1996, Journal of molecular biology.

[6]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[7]  Paul Fearnhead,et al.  Particle filters for mixture models with an unknown number of components , 2004, Stat. Comput..

[8]  George Varghese,et al.  A uniform projection method for motif discovery in DNA sequences , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[10]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[11]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[12]  Xiaodong Wang,et al.  Monte Carlo methods for signal processing: a review in the statistical signal processing context , 2005, IEEE Signal Processing Magazine.

[13]  Xiaobo Zhou,et al.  A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks , 2004, Bioinform..

[14]  Xiaole Liu,et al.  Statistical models for biological sequence motif discovery , 2002 .

[15]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[16]  Jun S. Liu,et al.  Sequential Monte Carlo methods for dynamic systems , 1997 .

[17]  Geir Storvik,et al.  Particle filters for state-space models with the presence of unknown static parameters , 2002, IEEE Trans. Signal Process..

[18]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[19]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[20]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[21]  Rong Chen,et al.  Wavelet-based sequential Monte Carlo blind receivers in fading channels with unknown channel statistics , 2002, IEEE Transactions on Signal Processing.

[22]  Christophe Andrieu,et al.  Robust Full Bayesian Learning for Neural Networks , 1999 .

[23]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[24]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[25]  Nando de Freitas,et al.  Robust Full Bayesian Learning for Radial Basis Networks , 2001, Neural Computation.

[26]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[27]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..