SQLEM: fast clustering in SQL using the EM algorithm

Clustering is one of the most important tasks performed in Data Mining applications. This paper presents an efficient SQL implementation of the EM algorithm to perform clustering in very large databases. Our version can effectively handle high dimensional data, a high number of clusters and more importantly, a very large number of data records. We present three strategies to implement EM in SQL: horizontal, vertical and a hybrid one. We expect this work to be useful for data mining programmers and users who want to cluster large data sets inside a relational DBMS.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..

[3]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[4]  Theodore Johnson,et al.  Squashing flat files flatter , 1999, KDD '99.

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[7]  Serge J. Belongie,et al.  Region-based image querying , 1997, 1997 Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries.

[8]  Carlos Ordonez,et al.  Discovering association rules based on image content , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[9]  Michael L. Heytens,et al.  NonStop SQL/MX primitives for knowledge discovery , 1999, KDD '99.

[10]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[14]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[15]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.