论文信息 - Mining categorical sequences from data using a hybrid clustering method

Mining categorical sequences from data using a hybrid clustering method

The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.

José G. Dias | José G. Dias | Luca De Angelis | L. Angelis

[1] Sylvia Kaufmann,et al. Model-Based Clustering of Multiple Time Series , 2004 .

[2] Eamonn J. Keogh,et al. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[3] W. Zucchini,et al. Hidden Markov Models for Time Series: An Introduction Using R , 2009 .

[4] Reza Yaesoubi,et al. Generalized Markov models of infectious disease spread: A novel framework for developing dynamic health policies , 2011, Eur. J. Oper. Res..

[5] E. Gassiat,et al. The likelihood ratio test for the number of components in a mixture with Markov regime , 2000 .

[6] Padhraic Smyth,et al. Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[7] Kate Smith-Miles,et al. Web page clustering using a self-organizing map of user navigation patterns , 2003, Decis. Support Syst..

[8] Athena Vakali,et al. A Divergence-Oriented Approach for Web Users Clustering , 2006, ICCSA.

[9] I. Csiszar,et al. The consistency of the BIC Markov order estimator , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[10] David He,et al. Hidden semi-Markov model-based methodology for multi-sensor equipment health diagnosis and prognosis , 2007, Eur. J. Oper. Res..

[11] H. Akaike. A new look at the statistical model identification , 1974 .

[12] Jackie Rees Ulmer,et al. Learning genetic algorithm parameters using hidden Markov models , 2006, Eur. J. Oper. Res..

[13] Mark Hansen,et al. Predicting Web Users' Next Access Based on Log Data , 2003 .

[14] Myra Spiliopoulou,et al. Data Mining for Measuring and Improving the Success of Web Sites , 2004, Data Mining and Knowledge Discovery.

[15] Allan Tucker,et al. Temporal Bayesian classifiers for modelling muscular dystrophy expression data , 2006, Intell. Data Anal..

[16] Balaji Padmanabhan,et al. GHIC: a hierarchical pattern-based clustering algorithm for grouping Web transactions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17] Paola Sebastiani,et al. Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18] T. Warren Liao,et al. Clustering of time series data - a survey , 2005, Pattern Recognit..

[19] C. S. Poulsen. Mixed Markov and latent Markov modelling applied to brand choice behaviour , 1990 .

[20] Junyi Shen,et al. A new Markov model for Web access prediction , 2002, Comput. Sci. Eng..

[21] Athena Vakali,et al. An Overview of Web Data Clustering Practices , 2004, EDBT Workshops.

[22] Tasha R. Inniss. Seasonal clustering technique for time series data , 2006, Eur. J. Oper. Res..

[23] M. Narasimha Murty,et al. Efficient clustering of large data sets , 2001, Pattern Recognition.

[24] José G. Dias,et al. An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods , 2004, Stat. Comput..

[25] Huberman,et al. Strong regularities in world wide web surfing , 1998, Science.

[26] Paul R. Cohen,et al. Bayesian Clustering by Dynamics Contents 1 Introduction 1 2 Clustering Markov Chains 2 , 2022 .

[27] Mário A. T. Figueiredo,et al. Similarity-based classification of sequences using hidden Markov models , 2004, Pattern Recognit..

[28] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[29] G. Schwarz. Estimating the Dimension of a Model , 1978 .