Clustering cubes with binary dimensions in one pass

Finding aggregations of records with high dimensionality in large data warehouses is a crucial and costly task. These groups of similar records are the result of partitions obtained with GROUP BYs. In this research, we focus on obtaining aggregations of groups of similar records by turning the problem into efficient binary clustering of a fact table as a relaxation of a GROUP BY clause. We present an efficient window-based Incremental K-Means algorithm in a relational database system implemented as a user-defined function. This variant is based on the Incremental K-Means algorithm. The speed up is achieved through the computation of sufficient statistics, multithreading, efficient distance computation and sparse matrix operations. Finally, the performance of our algorithm is compared against multiple variants of the K-Means algorithm. Our experiments show that our incremental K-Means algorithm achieves similar or even better results more quickly than the traditional K-Means algorithm.

[1]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[2]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[3]  Jiong Yang Dynamic clustering of evolving streams with a single pass , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[4]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[5]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[6]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[7]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[8]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[11]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[12]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[14]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[15]  Carlos Garcia-Alvarado,et al.  Efficient Distance Computation Using SQL Queries and UDFs , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[16]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[18]  Carlos Ordonez,et al.  Statistical Model Computation with UDFs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[21]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.