A Variational EM Acceleration for Efficient Clustering at Very Large Scales

How can we efficiently find very large numbers of clusters <italic>C</italic> in very large datasets <italic>N</italic> of potentially high dimensionality <italic>D</italic>? Here we address the question by using a novel variational approach to optimize Gaussian mixture models (GMMs) with diagonal covariance matrices. The variational method approximates expectation maximization (EM) by applying truncated posteriors as variational distributions and partial E-steps in combination with coresets. Run time complexity to optimize the clustering objective then reduces from <italic>O</italic>(<italic>NCD</italic>) per conventional EM iteration to <italic>O</italic>(<italic>N</italic>′<italic>G<sup>2</sup>D</italic>) for a variational EM iteration on coresets (with coreset size <italic>N ′ ≤ N</italic> and truncation parameter <italic>G</italic> ≪ <italic>C</italic>). Based on the strongly reduced run time complexity per iteration, which scales sublinearly with <italic>NC</italic>, we then provide a concrete, practically applicable, parallelized and highly efficient clustering algorithm. In numerical experiments on standard large-scale benchmarks we (A) show that also overall clustering times scale sublinearly with <italic>NC</italic>, and (B) observe substantial wall-clock speedups compared to already highly efficient recently reported results. The algorithm’s sublinear scaling allows for applications at scales where alternative methods cease to be applicable. We demonstrate such very large-scale applicability using the YFCC100M benchmark, for which we realize with a GMM of up to 50.000 clusters an optimization of a data density model with up to 150 M parameters.

[1]  G. Steidl,et al.  PCA Reduced Gaussian Mixture Models with Applications in Superresolution , 2020, Inverse Problems & Imaging.

[2]  A. Zimek,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[3]  Trevor Campbell,et al.  Sparse Variational Inference: Bayesian Coresets from Scratch , 2019, NeurIPS.

[4]  Yair Weiss,et al.  On GANs and GMMs , 2018, NeurIPS.

[5]  Ravi Varadhan,et al.  Damped Anderson Acceleration With Restarts and Monotonicity Control for Accelerating EM and EM-like Algorithms , 2018, Journal of Computational and Graphical Statistics.

[6]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jörg Lücke,et al.  Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and k-means , 2017, AISTATS.

[8]  Luc Van Gool,et al.  k^2 k 2 -means for Fast and Accurate Large Scale Clustering , 2017, ECML/PKDD.

[9]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[10]  Gaute T. Einevoll,et al.  YASS: Yet Another Spike Sorter , 2017, bioRxiv.

[11]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[12]  Ira Kemelmacher-Shlizerman,et al.  Level Playing Field for Million Scale Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alex Po Leung,et al.  Efficient k-means++ with random projection , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[14]  Jörg Lücke,et al.  k-means as a variational EM approximation of Gaussian mixture models , 2017, Pattern Recognit. Lett..

[15]  Akshay Krishnamurthy,et al.  An Online Hierarchical Algorithm for Extreme Clustering , 2017, ArXiv.

[16]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[17]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[18]  Jörg Lücke,et al.  Truncated variational EM for semi-supervised neural simpletrons , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[19]  Weiwei Liu,et al.  Compressed K-Means for Large-Scale Clustering , 2017, AAAI.

[20]  Jörg Lücke Truncated Variational Expectation Maximization , 2016 .

[21]  Erik B. Sudderth,et al.  Fast Learning of Clusters and Topics via Sparse Posteriors , 2016, ArXiv.

[22]  François Fleuret,et al.  K-Medoids For K-Means Seeding , 2016, NIPS.

[23]  Arto Klami,et al.  Probabilistic Size-constrained Microclustering , 2016, UAI.

[24]  Luc Van Gool,et al.  K2-means for Fast and Accurate Large Scale Clustering , 2016, ArXiv.

[25]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[27]  Jörg Lücke,et al.  Neural Simpletrons: Learning in the Limit of Few Labels with Directed Generative Networks , 2015, Neural Computation.

[28]  Fei Yang,et al.  Web scale photo hash clustering on a single machine , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jiwen Lu,et al.  Deep hashing for compact binary codes learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[31]  Arthur Gretton,et al.  GP-Select: Accelerating EM Using Adaptive Subspace Preselection , 2014, Neural Computation.

[32]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[33]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[35]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[36]  Jörg Lücke,et al.  A truncated EM approach for spike-and-slab sparse coding , 2012, J. Mach. Learn. Res..

[37]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[38]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[40]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[41]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[42]  R. Varadhan,et al.  Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm , 2008 .

[43]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[45]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[46]  C. Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[47]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[48]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[49]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[50]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[51]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.

[52]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[53]  G. McLachlan,et al.  Finite Mixture Models , 2000, Wiley Series in Probability and Statistics.

[54]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[55]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[56]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[57]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[58]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[59]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[60]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[61]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[62]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[63]  Allen Gersho,et al.  Fast search algorithms for vector quantization and pattern matching , 1984, ICASSP.

[64]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[65]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[66]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[67]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[68]  Ryan R. Curtin A Dual-Tree Algorithm for Fast k-means Clustering With Large k , 2017, SDM.

[69]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[70]  Jörg Lücke,et al.  Select-and-Sample for Spike-and-Slab Sparse Coding , 2016, NIPS.

[71]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[72]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[73]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[74]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .