A parameterizable enumeration algorithm for sequence mining

In this paper, we introduce an generic framework for the mining of sequences under various constraints. More precisely, we study the enumeration of all partitions of a word w into multisets of subsequences. We show that using additional predicates, this generator can be used for frequent subsequences and substrings mining. We define the transition graph T"w whose vertices are multisets of words and arcs are transitions between multisets. We show that T"w is a directed acyclic graph and it admits a covering tree. We use T"w to propose a generic algorithm that enumerates all multisets that satisfies a set of predicates, without redundancy.

[1]  Lhouari Nourine,et al.  Enumeration aspects of maximal cliques and bicliques , 2009, Discret. Appl. Math..

[2]  Julien David,et al.  The Average Complexity of Moore's State Minimization Algorithm Is O(n log log n) , 2010, MFCS.

[3]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[4]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Hiroki Arimura,et al.  Polynomial-Delay and Polynomial-Space Algorithms for Mining Closed Sequences, Graphs, and Pictures in Accessible Set Systems , 2009, SDM.

[7]  Hiroki Arimura,et al.  An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining , 2007, Discovery Science.

[8]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[9]  Devendra Singh,et al.  AN OVERVIEW OF THE APPLICATIONS OF MULTISETS 1 , 2007 .

[10]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[11]  David Haussler,et al.  On the Complexity of Iterated Shuffle , 1984, J. Comput. Syst. Sci..

[12]  Bart Goethals,et al.  Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Jean-Marc Petit,et al.  Extending Set-based Dualization: Application to Pattern Mining , 2012, ECAI.

[14]  Johanne Cohen,et al.  Shuffling biological sequences with motif constraints , 2008, J. Discrete Algorithms.

[15]  D. Singh,et al.  AN OVERVIEW OF THE APPLICATIONS OF MULTISETS , 2007 .