A general probabilistic framework for clustering individuals and objects

This paper presents a unifying probabilisti framework for lustering individuals or systems into groups when the available data measurements are not multivariate ve tors of xed dimensionality. For example, one might have data from a set of medi al patients, where for ea h patient one has a set of of observed time-series, ea h time-series of potentially di erent length and di erent sampling rate. We propose a general model-based probabilisti framework for lustering data types of this form whi h are non-ve tor in nature and may vary in size from individual to individual. The Expe tation-Maximization (EM) pro edure for lustering within this framework is dis ussed and we dis uss how it be applied in a general manner to lustering of sequen es, time-series, traje tories, and other non-ve tor data. We show that a number of earlier algorithms an be viewed as spe ial ases within this unifying framework. The paper on ludes with several illustrations of the method, in luding lustering of red blood ell data in a medi al diagnosis ontext, lustering of proteins from urves of gene expression data, and lustering of individuals based on their sequen es of Web navigation. General Terms Clustering, Mixture Models, EM Algorithm

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Biing-Hwang Juang,et al.  HMM clustering for connected word recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  C. S. Poulsen Mixed Markov and latent Markov modelling applied to brand choice behaviour , 1990 .

[5]  M. Wedel,et al.  A Clusterwise Regression Method for Simultaneous Fuzzy Market Structuring and Benefit Segmentation , 1991 .

[6]  Patsy Haccou,et al.  Statistical Analysis of Behavioural Data: An Approach Based on Time-structured Models , 1992 .

[7]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[8]  C. McLaren Mixture models in haematology: a series of case studies , 1996, Statistical methods in medical research.

[9]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[10]  G. Ridgeway Finite discrete Markov process clustering , 1997 .

[11]  R. Blender,et al.  Identification of cyclone‐track regimes in the North Atlantic , 1997 .

[12]  M. Wedel,et al.  Market Segmentation: Conceptual and Methodological Foundations , 1997 .

[13]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[16]  Geoffrey J. McLachlan,et al.  Hierarchical Models for Screening of Iron Deficiency Anemia , 1999, ICML.

[17]  Padhraic Smyth,et al.  Probabilistic Clustering using Hierarchical Models , 1999 .

[18]  Padhraic Smyth,et al.  Visualization of navigation patterns on a Web site using model-based clustering , 2000, KDD '00.