Mining categorical sequences from data using a hybrid clustering method

The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.

[1]  Sylvia Kaufmann,et al.  Model-Based Clustering of Multiple Time Series , 2004 .

[2]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[3]  W. Zucchini,et al.  Hidden Markov Models for Time Series: An Introduction Using R , 2009 .

[4]  Reza Yaesoubi,et al.  Generalized Markov models of infectious disease spread: A novel framework for developing dynamic health policies , 2011, Eur. J. Oper. Res..

[5]  E. Gassiat,et al.  The likelihood ratio test for the number of components in a mixture with Markov regime , 2000 .

[6]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[7]  Kate Smith-Miles,et al.  Web page clustering using a self-organizing map of user navigation patterns , 2003, Decis. Support Syst..

[8]  Athena Vakali,et al.  A Divergence-Oriented Approach for Web Users Clustering , 2006, ICCSA.

[9]  I. Csiszar,et al.  The consistency of the BIC Markov order estimator , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[10]  David He,et al.  Hidden semi-Markov model-based methodology for multi-sensor equipment health diagnosis and prognosis , 2007, Eur. J. Oper. Res..

[11]  H. Akaike A new look at the statistical model identification , 1974 .

[12]  Jackie Rees Ulmer,et al.  Learning genetic algorithm parameters using hidden Markov models , 2006, Eur. J. Oper. Res..

[13]  Mark Hansen,et al.  Predicting Web Users' Next Access Based on Log Data , 2003 .

[14]  Myra Spiliopoulou,et al.  Data Mining for Measuring and Improving the Success of Web Sites , 2004, Data Mining and Knowledge Discovery.

[15]  Allan Tucker,et al.  Temporal Bayesian classifiers for modelling muscular dystrophy expression data , 2006, Intell. Data Anal..

[16]  Balaji Padmanabhan,et al.  GHIC: a hierarchical pattern-based clustering algorithm for grouping Web transactions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Paola Sebastiani,et al.  Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[19]  C. S. Poulsen Mixed Markov and latent Markov modelling applied to brand choice behaviour , 1990 .

[20]  Junyi Shen,et al.  A new Markov model for Web access prediction , 2002, Comput. Sci. Eng..

[21]  Athena Vakali,et al.  An Overview of Web Data Clustering Practices , 2004, EDBT Workshops.

[22]  Tasha R. Inniss Seasonal clustering technique for time series data , 2006, Eur. J. Oper. Res..

[23]  M. Narasimha Murty,et al.  Efficient clustering of large data sets , 2001, Pattern Recognition.

[24]  José G. Dias,et al.  An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods , 2004, Stat. Comput..

[25]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[26]  Paul R. Cohen,et al.  Bayesian Clustering by Dynamics Contents 1 Introduction 1 2 Clustering Markov Chains 2 , 2022 .

[27]  Mário A. T. Figueiredo,et al.  Similarity-based classification of sequences using hidden Markov models , 2004, Pattern Recognit..

[28]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[29]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[30]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[31]  José G. Dias,et al.  Model-based Clustering of Sequential Data with an Application to Contraceptive Use Dynamics , 2005 .

[32]  Andrea Brandolini,et al.  Saving and the Accumulation of Wealth: Methodological Appendix: the Bank of Italy's Survey of Household Income and Wealth , 1994 .

[33]  José G. Dias,et al.  When Markets Fall Down: Are Emerging Markets All the Same? , 2010 .

[34]  Marie-Anne Guerry,et al.  Hidden heterogeneity in manpower systems: A Markov-switching model approach , 2011, Eur. J. Oper. Res..

[35]  T. Warren Liao,et al.  A clustering procedure for exploratory mining of vector time series , 2007, Pattern Recognit..

[36]  Padhraic Smyth,et al.  Model-Based Clustering and Visualization of Navigation Patterns on a Web Site , 2003, Data Mining and Knowledge Discovery.

[37]  L. Hubert,et al.  Comparing partitions , 1985 .

[38]  Pedro M. Domingos,et al.  Relational Markov models and their application to adaptive web navigation , 2002, KDD.

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Adrian E. Raftery,et al.  Time series analysis , 1985 .

[41]  Robert H. Shumway,et al.  Discrimination and Clustering for Multivariate Time Series , 1998 .

[42]  Roberto Bellotti,et al.  Hausdorff Clustering of Financial Time Series , 2007 .

[43]  José G. Dias,et al.  Latent class modeling of website users’ search patterns: Implications for online market segmentation , 2007 .

[44]  Anne Laurent,et al.  Sequential patterns for text categorization , 2006, Intell. Data Anal..

[45]  José G. Dias,et al.  The SKM Algorithm: A K-Means Algorithm for Clustering Sequential Data , 2008, IBERAMIA.

[46]  Donghua Zhou,et al.  A model for real-time failure prognosis based on hidden Markov model and belief rule base , 2010, Eur. J. Oper. Res..