Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer and requires at most a single database scan. Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data. We extend the method to efficiently update multiple models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large databases. 1 Preliminaries and Motivation Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization [DH73]. Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, general data reporting tasks, data cleaning and exploratory data analysis [B*96]. Clustering is a crucial data mining step and performing this task over large databases is essential. Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 2 A general view of clustering places it in the framework of density estimation [S86, S92, A73]. Clustering can be viewed as identifying the dense regions of the data source. An efficient representation of the probability density function is the mixture model, which asserts that the data is a combination of k individual component densities, corresponding to the k clusters. Basically, the problem is this: given data records (observations), identify a set of k populations in the data, and provide a model (density distribution) of each of the populations. Since the model assumes a mixture of populations, it is often referred to as a mixture model. The Expectation-Maximization (EM) algorithm [DLR77, CS96] is an effective and popular technique for estimating the mixture model parameters or fitting the model to the database. The EM algorithm iteratively refines an initial cluster model to better fit the data and terminates at a solution which is locally optimal or a saddle point of the underlying clustering criterion [DLR77, B95]. The objective function is log-likelihood of the data given the model measuring how well the probabilistic model fits the data. Other similar iterative refinement clustering methods include the popular K-Means-type algorithms [M67,DH73,F90,BMS97,SI84]. While these approaches have received attention in the database and data mining literature [NH94,ZRL97,BFR98], they are limited in their ability to compute correct statistical models of the data. The K-Mean algorithm minimizes the sum of squared Euclidean distances of between data records in a cluster and the cluster’s mean vector. This assignment criterion implicitly assumes that clusters are represented by spherical Gaussian distributions located at the k cluster means [BB95, B95]. Since the K-Mean algorithm utilizes the Euclidean metric, it does not generalize to the problem of clustering discrete or categorical data. The K-Mean algorithm also uses a membership function which assigns each data record to exactly one cluster. This harsh criteria does not allow for uncertainty in the membership of a data record in a cluster. The mixture model framework relaxes these assumptions. Due to the probabilistic nature of the mixture model, arbitrary shaped clusters (i.e. non-spherical, etc.) can be effectively represented by the choice of suitable component density functions (e.g. Poission, non-spherical Gaussians, etc.). Categorical or discrete data is similarly handled by associating discrete data distribution over these attributes (e.g. Mutinomial, Binomial, etc.). Consider a simple example with data consisting of 2 attributes: age and income. One may choose to model the data as a single cluster and report that average age over the data records is 41 years and an average income is $26K/year (with associated variances). However, this may be rather deceptive and uninformative. The data may be a mixture of working people, retired people, and Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 3 children. A more informative summary might identify these subsets or clusters, and report the cluster parameters. Such results are shown in Table 1.1: Table 1.1: Sample data summary by segment “name” (not given) Size Average Age Average Income “working” 45% 38 $45K “retired” 30% 72 $20K “children” 20% 12 $0
[1]
J. MacQueen.
Some methods for classification and analysis of multivariate observations
,
1967
.
[2]
H. L. Le Roy,et al.
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV
,
1969
.
[3]
Michael R. Anderberg,et al.
Cluster Analysis for Applications
,
1973
.
[4]
Richard O. Duda,et al.
Pattern classification and scene analysis
,
1974,
A Wiley-Interscience publication.
[5]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[6]
Shokri Z. Selim,et al.
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality
,
1984,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7]
P. J. Green,et al.
Density Estimation for Statistics and Data Analysis
,
1987
.
[8]
John E. Moody,et al.
Towards Faster Stochastic Gradient Search
,
1991,
NIPS.
[9]
A. Raftery,et al.
Model-based Gaussian and non-Gaussian clustering
,
1993
.
[10]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[11]
Yoshua Bengio,et al.
Convergence Properties of the K-Means Algorithms
,
1994,
NIPS.
[12]
Jiawei Han,et al.
Efficient and Effective Clustering Methods for Spatial Data Mining
,
1994,
VLDB.
[13]
Daryl Pregibon,et al.
A Statistical Perspective on Knowledge Discovery in Databases
,
1996,
Advances in Knowledge Discovery and Data Mining.
[14]
Paul S. Bradley,et al.
Clustering via Concave Minimization
,
1996,
NIPS.
[15]
Peter C. Cheeseman,et al.
Bayesian Classification (AutoClass): Theory and Results
,
1996,
Advances in Knowledge Discovery and Data Mining.
[16]
Miron Livny,et al.
Fast Density and Probability Estimation Using CF-Kernel Method for Very Large Databases
,
1996
.
[17]
J. Simonoff.
Multivariate Density Estimation
,
1996
.
[18]
U. M. Feyyad.
Data mining and knowledge discovery: making sense out of data
,
1996
.
[19]
Padhraic Smyth,et al.
Clustering Using Monte Carlo Cross-Validation
,
1996,
KDD.
[20]
Enrico Tronci.
1997
,
1997,
Les 25 ans de l’OMC: Une rétrospective en photos.
[21]
Paul S. Bradley,et al.
Scaling Clustering Algorithms to Large Databases
,
1998,
KDD.
[22]
Paul S. Bradley,et al.
Refining Initial Points for K-Means Clustering
,
1998,
ICML.
[23]
Michael I. Jordan.
Learning in Graphical Models
,
1999,
NATO ASI Series.
[24]
Dimitrios Gunopulos,et al.
Automatic subspace clustering of high dimensional data for data mining applications
,
1998,
SIGMOD '98.
[25]
Sudipto Guha,et al.
CURE: an efficient clustering algorithm for large databases
,
1998,
SIGMOD '98.
[26]
Geoffrey E. Hinton,et al.
A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants
,
1998,
Learning in Graphical Models.
[27]
Paul S. Bradley,et al.
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions
,
1999,
KDD '99.
[28]
Kristin P. Bennett,et al.
Density-based indexing for approximate nearest-neighbor queries
,
1999,
KDD '99.
[29]
Tian Zhang,et al.
Fast density estimation using CF-kernel for very large databases
,
1999,
KDD '99.
[30]
Jon M. Kleinberg,et al.
Clustering categorical data: an approach based on dynamical systems
,
2000,
The VLDB Journal.