Discovering Best Variable-Length-Don't-Care Patterns

A variable-length-don't-care pattern (VLDC pattern) is an element of set ? = (??{*})*, where ? is an alphabet and * is a wildcard matching any string in ?*. Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the other. We present a practical algorithm to find such best VLDC patterns exactly, powerfully sped up by pruning heuristics. We introduce two versions of our algorithm: one employs a pattern matching machine (PMM) whereas the other does an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). In addition, we consider a more generalized problem of finding the best pair ?q, k?, where k is the window size that specifies the length of an occurrence of the VLDC pattern q matching a string w. We present three algorithms solving this problem with pruning heuristics, using the dynamic programming (DP), PMMs and WDAWGs, respectively. Although the two problems are NP-hard, we experimentally show that our algorithms run remarkably fast.

[1]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[2]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[3]  G. von Heijne,et al.  Domain structure of mitochondrial and chloroplast targeting peptides. , 1989, European journal of biochemistry.

[4]  Ayumi Shinohara,et al.  Space-Economical Construction of Index Structures for All Suffixes of a String , 2002, MFCS.

[5]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[6]  Zdenek Tronícek,et al.  Episode Matching , 2001, CPM.

[7]  Ayumi Shinohara,et al.  A Practical Algorithm to Find the Best Subsequence Patterns , 2000, Discovery Science.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[10]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[11]  G. Vonheijne The signal peptide. , 1990 .

[12]  G. Heijne The signal peptide , 2005, The Journal of Membrane Biology.

[13]  Rolf Wiehagen,et al.  Polynomial-Time Inference of Pattern Languages , 1990, ALT.

[14]  Ayumi Shinohara,et al.  Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI , 1992 .

[15]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[16]  S. Rao Kosaraju,et al.  Efficient tree pattern matching , 1989, 30th Annual Symposium on Foundations of Computer Science.

[17]  Ricardo A. Baeza-Yates,et al.  Searching Subsequences , 1991, Theor. Comput. Sci..

[18]  Ayumi Shinohara,et al.  The Minimum DAWG for All Suffixes of a String and Its Applications , 2002, CPM.

[19]  Ayumi Shinohara,et al.  A Practical Algorithm to Find the Best Episode Patterns , 2001, Discovery Science.

[20]  Ayumi Shinohara,et al.  Finding Best Patterns Practically , 2002, Progress in Discovery Science.

[21]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[22]  Dana Angluin,et al.  Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[23]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[24]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[25]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.