Scalable Probabilistic Clustering

The Expectation-Maximization (EM) algorithm is a popular approach to probabilistic database clustering. A database of observations is clustered by identifying k sub-populations and summarizing each sub- population with a model or probability density function. The EM algorithm is an approach that iteratively estimates the memberships of the observations in each cluster and the parameters of the k density functions for each cluster. Typical EM implementations require a full database scan at each iteration and the number of iterations required to converge is arbitrary. For large databases, these scans become prohibitively expensive. We present a scalable implementation of the EM algorithm based upon identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer. Data resolution is preserved to the extent possible based upon the size of the memory buffer and the fit of the current clustering model to the data. We extend the framework to update multiple cluster models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based and incremental approaches — the straightforward alternatives to “scaling” existing traditional in-memory implementations to large databases.

[1]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[2]  Evangelos Simoudis,et al.  Mining business databases , 1996, CACM.

[3]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[4]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[5]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[6]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[7]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[8]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Jay L. Devore,et al.  Probability and statistics for engineering and the sciences , 1982 .

[11]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  David Haussler,et al.  KDD for Science Data Analysis: Issues and Examples , 1996, KDD.

[13]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[14]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[15]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[16]  Stanley L. Sclove,et al.  Application of the Conditional Population-Mixture Model to Image Segmentation , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[18]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[19]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[20]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[21]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[22]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[23]  Sanford Weisberg,et al.  Computing science and statistics : proceedings of the 30th Symposium on the Interface, Minneapolis, Minnesota, May 13-16, 1998 : dimension reduction, computational complexity and information , 1998 .

[24]  Paul S. Bradley,et al.  Compressed data cubes for OLAP aggregate query approximation on continuous dimensions , 1999, KDD '99.