Effcient Algorithms for Mining Maximal Flexible Patterns in Texts and Sequences

In this paper, we study the maximal pattern discovery problem in a given sequence for the class ERP of flexible patterns with applications to text mining, where a flexible pattern is a sequence of constant and wildcards for possibly empty strings such as AB*B*ABC, and also known as erasing regular patterns. We first discuss the framework of optimal pattern discovery for predictive mining and text classification, and then show its connection to maximal pattern discovery. Then, we introduce a new notion of maximality of patterns based on the position occurrences of patterns, called position-maximality. We present an efficient algorithm PosMaxFlexMotif that, given an input string of length n over an alphabet Σ, enumerates all maximal patterns of ERP without duplicates in O(kmn2) time per maximal pattern using O(mn) space, where m = |P | is the size of the pattern P to be enumerated, and k = O(m) is the number of variables in P . This implies as corollary that the position-maximal enumeration problem for flexible patterns is output-polynomial time solvable. Then, we apply the above result to maximal pattern discovery in terms of the maximality based on document occurrence as a sound pruning technique.

[1]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[2]  David Avis,et al.  Reverse Search for Enumeration , 1996, Discret. Appl. Math..

[3]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[4]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[5]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[6]  R. Schapire,et al.  Toward Efficient Agnostic Learning , 1994 .

[7]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[8]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[9]  Takeshi Shinohara,et al.  Polynomial Time Inference of Extended Regular Pattern Languages , 1983, RIMS Symposium on Software Science and Engineering.

[10]  Maxime Crochemore,et al.  A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum , 2003, MFCS.

[11]  Hiroki Arimura,et al.  Protein Motif Discovery from Positive Examples by Minimal Multiple Generalization over Regular Patterns , 1994 .

[12]  Hiroki Arimura,et al.  A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence , 2005, ISAAC.

[13]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.