Rk-means: Fast Clustering for Relational Data

Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k -means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small {\it grid coreset} of the data matrix for subsequent cluster construction. This gives a constant approximation for the k -means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.

[1]  Juan Antonio Cuesta-Albertos,et al.  Robust clustering tools based on optimal transportation , 2016, Statistics and Computing.

[2]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Dan Olteanu,et al.  A Layered Aggregate Engine for Analytics Workloads , 2019, SIGMOD Conference.

[4]  Field Cady,et al.  The Data Science Handbook , 2017 .

[5]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[6]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Dániel Marx,et al.  Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries , 2009, JACM.

[8]  Hung Q. Ngo,et al.  In-Database Learning with Sparse Tensors , 2017, PODS.

[9]  David P. Woodruff,et al.  Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[10]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[11]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[12]  David Pollard,et al.  Quantization and the method of k -means , 1982, IEEE Trans. Inf. Theory.

[13]  Ryan R. Curtin,et al.  Mlpack 3: a Fast, Flexible Machine Learning Library , 2018, J. Open Source Softw..

[14]  Dinh Q. Phung,et al.  Multilevel Clustering via Wasserstein Means , 2017, ICML.

[15]  Mikkel Thorup,et al.  Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, SIAM J. Comput..

[16]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[17]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[18]  Haizhou Wang,et al.  Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming , 2011, R J..

[19]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[20]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[21]  Dan Olteanu,et al.  Factorized Databases , 2016, SGMD.

[22]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[23]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[24]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[25]  Conrad Sanderson,et al.  A User-Friendly Hybrid Sparse Matrix Class in C++ , 2018, ICMS.

[26]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[27]  Dániel Marx,et al.  Constraint solving via fractional edge covers , 2006, SODA '06.

[28]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[29]  Linda C. van der Gaag,et al.  Probabilistic Graphical Models , 2014, Lecture Notes in Computer Science.

[30]  Vladimir Braverman,et al.  Clustering High Dimensional Dynamic Data Streams , 2017, ICML.

[31]  Adam Meyerson,et al.  A k-Median Algorithm with Running Time Independent of Data Size , 2004, Machine Learning.

[32]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm , 2012, 1210.0481.

[33]  Hung Q. Ngo,et al.  AC/DC: In-Database Learning Thunderstruck , 2018, DEEM@SIGMOD.

[34]  Atri Rudra,et al.  FAQ: Questions Asked Frequently , 2015, PODS.

[35]  Benjamin Moseley,et al.  On Functional Aggregate Queries with Additive Inequalities , 2018, PODS.

[36]  Carlos Ordonez,et al.  Efficient disk-based K-means clustering for relational databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  S. Graf,et al.  Foundations of Quantization for Probability Distributions , 2000 .

[38]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[39]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[40]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[41]  Atri Rudra,et al.  Juggling Functions Inside a Database , 2017, SGMD.

[42]  Hung Q. Ngo,et al.  Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems , 2018, PODS.

[43]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[44]  Wei Sun,et al.  Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[45]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .