论文信息 - An Output-Sensitive Flexible Pattern Discovery Algorithm

An Output-Sensitive Flexible Pattern Discovery Algorithm

Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with "dont care" characters and a flexible motif could have a variable (as opposed to fixed) number of "dont care" characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f ? F, with identical (or similar) repeating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O((n5 + N)log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N? in O(N? logn) time. The core motifs of the first phase are also characterized as being those of "highest specificity": loosely speaking, a pattern with higher specificity has less "dont care" characters. Some applications (for instance the ones that require the study of those portions of the input sequence that contribute to the non-gapped regions of motifs) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.

[1] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[2] Mikhail A. Roytberg. A search for common patterns in many sequences , 1992, Comput. Appl. Biosci..

[3] Kaizhong Zhang,et al. Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[4] A. F. Neuwald,et al. Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[5] M. Suyama,et al. Searching for common sequence patterns among distantly related proteins. , 1995, Protein engineering.

[6] D. Higgins,et al. Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[7] L. Penland,et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[8] Alain Viari,et al. A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[9] L. Wodicka,et al. Genome-wide expression monitoring in Saccharomyces cerevisiae , 1997, Nature Biotechnology.

[10] Michael Gribskov,et al. Methods and Statistics for Combining Motif Match Scores , 1998, J. Comput. Biol..

[11] Aris Floratos,et al. Motif discovery without alignment or enumeration (extended abstract) , 1998, RECOMB '98.

[12] David R. Gilbert,et al. Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[13] Ron Shamir,et al. Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[14] D. Botstein,et al. Exploring the new world of the genome with DNA microarrays , 1999, Nature Genetics.

[15] Zohar Yakhini,et al. Clustering gene expression patterns , 1999, J. Comput. Biol..

[16] Laxmi Parida. Some Results on Flexible-Pattern Discovery , 2000, CPM.

[17] Andrea Califano,et al. SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[18] Yuan Gao,et al. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.