Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer

The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture model to the set of sequences. Multiple motifs are found by fitting a mixture model to the data, probabilistically erasing the occurrences of the motif thus found, and repeating the process to find successive motifs. The algorithm requires only a set of unaligned sequences and a number specifying the width of the motifs as input. It returns a model of each motif and a threshold which together can be used as a Bayes-optimal classifier for searching for occurrences of the motif in other databases. The algorithm estimates how many times each motif occurs in each sequence in the dataset and outputs an alignment of the occurrences of the motif. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset.

[1]  T. Creighton Methods in Enzymology , 1968, The Yale Journal of Biology and Medicine.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[4]  D. Rubin,et al.  Estimation and Hypothesis Testing in Finite Mixture Models , 1985 .

[5]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[6]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[7]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[8]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[9]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[10]  A. Murray,et al.  Novel repetitive sequence motifs in the alpha and beta subunits of prenyl-protein transferases and homology of the alpha subunit to the MAD2 gene product of yeast. , 1992, The New biologist.

[11]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[12]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[13]  M. Gribskov,et al.  Profile Analysis , 1970 .

[14]  J. A. Salvato John wiley & sons. , 1994, Environmental science & technology.

[15]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..