Inference of binding sites with a Bayesian multiple-instance motif discovery method

We present a Bayesian motif discovery (BMD) algorithm for detecting an unknown number of instances of a motif in a given set of sequences. The algorithm models a motif with a position weight matrix (PWM), which is estimated along with the motif discovery process. This technique is flexible enough to enable other discovery algorithms' results to be used as input. The method is based on a sequential Monte Carlo algorithm, where the state to be estimated consists of the number of instances in each sequence and their initial positions. The accuracy of the proposed method is compared with other profile-based discovery algorithms. BMD is shown to perform statistically better than MEME and BioProspector in applications ranging from synthetic data to genomic motif finding of Din serine recombinases. In the case of site-specific recombinase target discovery, BMD-inferred motif is found to be the only functionally accurate from the underlying biochemical mechanism standpoint.

[1]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[2]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3]  A. Lehninger Principles of Biochemistry , 1984 .

[4]  S. J. Billington,et al.  A multiple site‐specific DNA‐inversion model for the control of Ompi phase and antigenic variation in Dichelobacter nodosus , 1995, Molecular microbiology.

[5]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[6]  N. Grindley,et al.  Mechanisms of site-specific recombination. , 2003, Annual review of biochemistry.

[7]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[8]  Reid C. Johnson Bacterial Site-Specific DNA Inversion Systems , 2002 .

[9]  Bin Dong,et al.  A new class of soft MIMO demodulation algorithms , 2003, IEEE Trans. Signal Process..

[10]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[11]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[12]  Xiaodong Wang,et al.  A profile-based deterministic sequential Monte Carlo algorithm for motif discovery , 2008, Bioinform..

[13]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[14]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[15]  Xiaodong Wang,et al.  Joint multiple target tracking and classification in collaborative sensor networks , 2005, IEEE Journal on Selected Areas in Communications.

[16]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..

[17]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.