论文信息 - Discovering Best Variable-Length-Don't-Care Patterns

Discovering Best Variable-Length-Don't-Care Patterns

A variable-length-don't-care pattern (VLDC pattern) is an element of set ? = (??{*})*, where ? is an alphabet and * is a wildcard matching any string in ?*. Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the other. We present a practical algorithm to find such best VLDC patterns exactly, powerfully sped up by pruning heuristics. We introduce two versions of our algorithm: one employs a pattern matching machine (PMM) whereas the other does an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). In addition, we consider a more generalized problem of finding the best pair ?q, k?, where k is the window size that specifies the length of an occurrence of the VLDC pattern q matching a string w. We present three algorithms solving this problem with pruning heuristics, using the dynamic programming (DP), PMMs and WDAWGs, respectively. Although the two problems are NP-hard, we experimentally show that our algorithms run remarkably fast.

[1] Maxime Crochemore,et al. Transducers and Repetitions , 1986, Theor. Comput. Sci..

[2] 瀬々潤,et al. Traversing Itemset Lattices with Statistical Metric Pruning (小特集「発見科学」及び一般演題) , 2000 .

[3] G. von Heijne,et al. Domain structure of mitochondrial and chloroplast targeting peptides. , 1989, European journal of biochemistry.

[4] Ayumi Shinohara,et al. Space-Economical Construction of Index Structures for All Suffixes of a String , 2002, MFCS.

[5] Wojciech Rytter,et al. Text Algorithms , 1994 .

[6] Zdenek Tronícek,et al. Episode Matching , 2001, CPM.

[7] Ayumi Shinohara,et al. A Practical Algorithm to Find the Best Subsequence Patterns , 2000, Discovery Science.

[8] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9] R. Doolittle,et al. A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[10] 김동규,et al. [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[11] G. Vonheijne. The signal peptide. , 1990 .

[12] G. Heijne. The signal peptide , 2005, The Journal of Membrane Biology.

[13] Rolf Wiehagen,et al. Polynomial-Time Inference of Pattern Languages , 1990, ALT.

[14] Ayumi Shinohara,et al. Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI , 1992 .

[15] Shinichi Morishita,et al. Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[16] S. Rao Kosaraju,et al. Efficient tree pattern matching , 1989, 30th Annual Symposium on Foundations of Computer Science.

[17] Ricardo A. Baeza-Yates,et al. Searching Subsequences , 1991, Theor. Comput. Sci..

[18] Ayumi Shinohara,et al. The Minimum DAWG for All Suffixes of a String and Its Applications , 2002, CPM.

[19] Ayumi Shinohara,et al. A Practical Algorithm to Find the Best Episode Patterns , 2001, Discovery Science.

[20] Ayumi Shinohara,et al. Finding Best Patterns Practically , 2002, Progress in Discovery Science.

[21] M. Crochemore,et al. On-line construction of suffix trees , 2002 .

[22] Dana Angluin,et al. Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[23] David Haussler,et al. The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[24] Dimitrios Gunopulos,et al. Episode Matching , 1997, CPM.

[25] Heikki Mannila,et al. Discovering Frequent Episodes in Sequences , 1995, KDD.