论文信息 - Large-Scale High-Dimensional Clustering with Fast Sketching

Large-Scale High-Dimensional Clustering with Fast Sketching

In this paper, we address the problem of high-dimensional k-means clustering in a large-scale setting, i.e. for datasets that comprise a large number of items. Sketching techniques have already been used to deal with this “large-scale” issue, by compressing the whole dataset into a single vector of random nonlinear generalized moments from which the $k$ centroids are then retrieved efficiently. However, this approach usually scales quadratically with the dimension; to cope with high-dimensional datasets, we show how to use fast structured random matrices to compute the sketching operator efficiently. This yields significant speed-ups and memory savings for high-dimensional data, while the clustering results are shown to be much more stable, both on artificial and real datasets.

[1] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[2] L. Hansen. Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[3] Benjamin Recht,et al. The alternating descent conditional gradient method for sparse inverse problems , 2015, 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[4] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5] Rémi Gribonval,et al. Compressive K-means , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7] Alexander J. Smola,et al. Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[8] Rémi Gribonval,et al. LocOMP: algorithme localement orthogonal pour l'approximation parcimonieuse rapide de signaux longs sur des dictionnaires locaux , 2009 .

[9] Christos Boutsidis,et al. Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[10] Bernard Chazelle,et al. The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[11] Rémi Gribonval,et al. Compressive Statistical Learning with Random Feature Moments , 2017, Mathematical Statistics and Learning.

[12] LeskovecJure,et al. Defining and evaluating network communities based on ground-truth , 2015 .

[13] Jirí Matousek,et al. Low-Distortion Embeddings of Finite Metric Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[14] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[15] M E J Newman,et al. Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16] Andreas Krause,et al. Advances in Neural Information Processing Systems (NIPS) , 2014 .

[17] Pierre Vandergheynst,et al. Compressive Spectral Clustering , 2016, ICML.

[18] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19] Christos Boutsidis,et al. Random Projections for $k$-means Clustering , 2010, NIPS.

[20] Christopher R. Taber,et al. Generalized Method of Moments , 2020, Time Series Analysis.

[21] Gert Vegter,et al. In handbook of discrete and computational geometry , 1997 .

[22] W. B. Johnson,et al. Extensions of Lipschitz mappings into Hilbert space , 1984 .

[23] Jeff M. Phillips,et al. Coresets and Sketches , 2016, ArXiv.

[24] Markus Püschel,et al. In search of the optimal Walsh-Hadamard transform , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25] S. Canu,et al. Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[26] Andrew R. Teel,et al. ESAIM: Control, Optimisation and Calculus of Variations , 2022 .

[27] Luc Van Gool,et al. F2F: A Library For Fast Kernel Expansions , 2017, ArXiv.

[28] Jure Leskovec,et al. Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[29] Anne Morvan,et al. Structured adaptive and random spinners for fast machine learning computations , 2016, AISTATS.

[30] Krzysztof Choromanski,et al. Recycling Randomness with Structure for Sublinear time Kernel Expansions , 2016, ICML.

[31] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[32] K. Bredies,et al. Inverse problems in spaces of measures , 2013 .

[33] Patrick Pérez,et al. Sketching for Large-Scale Learning of Mixture Models. (Apprentissage de modèles de mélange à large échelle par Sketching) , 2017 .

[34] Sanjiv Kumar,et al. Orthogonal Random Features , 2016, NIPS.