Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a minimal distinguishing subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the maximum gap constraint. We present an efficient algorithm called ConSGapMiner, to mine all MDSs according to a maximum gap constraint. It employs highly efficient bitset and Boolean operations, for powerful gap based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.

[1]  Luc De Raedt,et al.  Towards Optimizing Conjunctive Inductive Queries , 2004, KDID.

[2]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[3]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[4]  Zhou Wen,et al.  Efficient mining of emerging patterns , 2002 .

[5]  Ayumi Shinohara,et al.  A practical algorithm to find the best subsequence patterns , 2000, Theor. Comput. Sci..

[6]  Mohammed J. Zaki Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.

[7]  Jiawei Han,et al.  TSP: Mining top-k closed sequential patterns , 2004, Knowledge and Information Systems.

[8]  Jinyan Li,et al.  Mining border descriptions of emerging patterns from dataset pairs , 2005, Knowledge and Information Systems.

[9]  James Bailey,et al.  Classification Using Constrained Emerging Patterns , 2003, WAIM.

[10]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Cláudia Antunes,et al.  Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints , 2003, MLDM.

[12]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[13]  Gemma C. Garriga Discovering Unbounded Episodes in Sequential Data , 2003, PKDD.

[14]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[15]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[16]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[17]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[18]  Giri Narasimhan,et al.  Mining Protein Sequences for Motifs , 2002, J. Comput. Biol..

[19]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[20]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[21]  Christophe Rigotti,et al.  Constraint-Based Mining of Episode Rules and Optimal Window Sizes , 2004, PKDD.

[22]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[23]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[24]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Kotagiri Ramamohanarao,et al.  Making Use of the Most Expressive Jumping Emerging Patterns for Classification , 2001, Knowledge and Information Systems.

[26]  Chi Lap Yip,et al.  Mining emerging substrings , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[27]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[28]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.

[29]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[30]  Gemma Casas-Garriga Discovering Unbounded Episodes in Sequential Data , 2003 .

[31]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[32]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[33]  Mohammed J. Zaki,et al.  Scalable Feature Mining for Sequential Data , 2000, IEEE Intell. Syst..

[34]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[35]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.