MADMX: A Strategy for Maximal Dense Motif Extraction

We develop, analyze, and experiment with a new tool, called MADMX, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the "significant" motifs. The density is a simple and flexible measure for bounding the number of don't cares in a motif, defined as the fraction of solid (i.e., different from don't care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don't care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX.

[1]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[2]  Alberto Apostolico,et al.  Incremental discovery of the irredundant motif bases for all suffixes of a string in O(n2logn) time , 2008, Theor. Comput. Sci..

[3]  Giri Narasimhan,et al.  Pattern discovery in bioinformatics , 2007 .

[4]  S. Bridges,et al.  Empirical comparison of ab initio repeat finding programs , 2008, Nucleic acids research.

[5]  Matteo Comin,et al.  VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Hiroki Arimura,et al.  Mining Maximal Flexible Patterns in a Sequence , 2007, JSAI.

[7]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[8]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Esko Ukkonen,et al.  On the complexity of finding gapped motifs , 2008, J. Discrete Algorithms.

[10]  Alberto Apostolico,et al.  Optimal Offline Extraction of Irredundant Motif Bases , 2007, COCOON.

[11]  Esko Ukkonen Structural Analysis of Gapped Motifs of a String , 2007, MFCS.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  Laxmi Parida Some Results on Flexible-Pattern Discovery , 2000, CPM.

[14]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[15]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..