Stochastic EM-based TFBS motif discovery with MITSU

Motivation: The Expectation–Maximization (EM) algorithm has been successfully applied to the problem of transcription factor binding site (TFBS) motif discovery and underlies the most widely used motif discovery algorithms. In the wider field of probabilistic modelling, the stochastic EM (sEM) algorithm has been used to overcome some of the limitations of the EM algorithm; however, the application of sEM to motif discovery has not been fully explored. Results: We present MITSU (Motif discovery by ITerative Sampling and Updating), a novel algorithm for motif discovery, which combines sEM with an improved approximation to the likelihood function, which is unconstrained with regard to the distribution of motif occurrences within the input dataset. The algorithm is evaluated quantitatively on realistic synthetic data and several collections of characterized prokaryotic TFBS motifs and shown to outperform EM and an alternative sEM-based algorithm, particularly in terms of site-level positive predictive value. Availability and implementation: Java executable available for download at http://www.sourceforge.net/p/mitsu-motif/, supported on Linux/OS X. Contact: a.m.kilpatrick@sms.ed.ac.uk

[1]  Julio Collado-Vides,et al.  RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) , 2010, Nucleic Acids Res..

[2]  Gilles Celeux,et al.  On Stochastic Versions of the EM Algorithm , 1995 .

[3]  Chengpeng Bi,et al.  A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Gautier Koscielny,et al.  Analysis of variation at transcription factor binding sites in Drosophila and humans , 2012, Genome Biology.

[5]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[6]  Chengpeng Bi A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences , 2009, TCBB.

[7]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[8]  Z. Weng,et al.  Functional analysis of transcription factor binding sites in human promoters , 2012, Genome Biology.

[9]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[10]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[11]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[12]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[13]  Sunduz Keles,et al.  Statistical Applications in Genetics and Molecular Biology Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo , 2011 .

[14]  Chengpeng Bi,et al.  Seam: a Stochastic EM-Type Algorithm for Motif-Finding in Biopolymer Sequences , 2007, J. Bioinform. Comput. Biol..

[15]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[16]  J. Booth,et al.  Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm , 1999 .

[17]  Philip Machanick,et al.  The value of position-specific priors in motif discovery using MEME , 2010, BMC Bioinformatics.

[18]  C. Robert,et al.  Estimation of Finite Mixture Distributions Through Bayesian Sampling , 1994 .

[19]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[20]  Alastair M. Kilpatrick,et al.  MCOIN: a novel heuristic for determining transcription factor binding site motif width , 2013, Algorithms for Molecular Biology.

[21]  M. Eisen All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity , 2005, Genome Biology.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[24]  M. Eisen,et al.  Supervised Detection of Regulatory Motifs in DNA Sequences , 2003, Statistical applications in genetics and molecular biology.