Taxonomy-driven lumping for sequence mining

Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.

[1]  R. J. MacKay,et al.  HIDDEN MARKOV MODELS: MULTIPLE PROCESSES AND MODEL SELECTION , 2003 .

[2]  Valerie Guralnik,et al.  A scalable algorithm for clustering protein sequences , 2001, BIOKDD.

[3]  Charu C. Aggarwal,et al.  Discriminating Subsequence Discovery for Sequence Clustering , 2007, SDM.

[4]  Robert E. Mahony,et al.  Lumpable hidden Markov models-model reduction and reduced complexity filtering , 2000, IEEE Trans. Autom. Control..

[5]  Thomas Brinkhoff,et al.  Generating Traffic Data , 2003, IEEE Data Eng. Bull..

[6]  Jae-Gil Lee,et al.  Traffic Density-Based Discovery of Hot Routes in Road Networks , 2007, SSTD.

[7]  Shui-Lung Chuang,et al.  Towards automatic generation of query taxonomy: a hierarchical query clustering approach , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[9]  Jin-Hyung Kim,et al.  An HMM-Based Threshold Model Approach for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  C. Lee Giles,et al.  Probabilistic user behavior models , 2003, Third IEEE International Conference on Data Mining.

[11]  Jon M. Kleinberg,et al.  Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis , 2003, NIPS.

[12]  Stefania Leone Extending Database Technology: a New Document Data Type , 2006, CAiSE.

[13]  Ata Kabán,et al.  Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles , 2003, NIPS.

[14]  Jae-Gil Lee,et al.  Trajectory clustering: a partition-and-group framework , 2007, SIGMOD '07.

[15]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[16]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[17]  Mark Levene,et al.  A Dynamic Clustering-Based Markov Model for Web Usage Mining , 2004, ArXiv.

[18]  Dino Pedreschi,et al.  Time-focused clustering of trajectories of moving objects , 2006, Journal of Intelligent Information Systems.

[19]  Mário A. T. Figueiredo,et al.  Similarity-Based Clustering of Sequences Using Hidden Markov Models , 2003, MLDM.

[20]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Agostino Dovier,et al.  Designing the Minimal Structure of Hidden Markov Model by Bisimulation , 2001, EMMCVPR.

[23]  Carole A. Goble,et al.  Clustering Techniques in Biological Sequence Analysis , 1997, PKDD.

[24]  Herbert A. Simon,et al.  Aggregation of Variables in Dynamic Systems , 1961 .

[25]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[26]  James T. Kwok,et al.  Rival penalized competitive learning for model-based sequence clustering , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[27]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[28]  Henk Tijms,et al.  Stochastic modelling and analysis: a computational approach , 1986 .

[29]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[30]  Gultekin Özsoyoglu,et al.  Taxonomy-superimposed graph mining , 2008, EDBT '08.

[31]  Jae-Gil Lee,et al.  TraClass: trajectory classification using hierarchical region-based and trajectory-based clustering , 2008, Proc. VLDB Endow..

[32]  Carl D. Meyer,et al.  Stochastic Complementation, Uncoupling Markov Chains, and the Theory of Nearly Reducible Systems , 1989, SIAM Rev..

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .