A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data

Learning a statistical model for high-dimensional data is an important topic in machine learning. Although this problem has been well studied in the supervised setting, little is known about its unsupervised counterpart. In this work, we focus on the problem of clustering high-dimensional data with sparse centers. In particular, we address the following open question in unsupervised learning: "is it possible to reliably cluster high-dimensional data when the number of samples is smaller than the data dimensionality?" We develop an efficient clustering algorithm that is able to estimate sparse cluster centers with a single pass over the data. Our theoretical analysis shows that the proposed algorithm is able to accurately recover cluster centers with only O(s log d) number of samples (data points), provided all the cluster centers are s-sparse vectors in a d dimensional space. Experimental results verify both the effectiveness and efficiency of the proposed clustering algorithm compared to the state-of-the-art algorithms on several benchmark datasets.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Wei Sun,et al.  Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[4]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[5]  Gregory Shakhnarovich,et al.  An investigation of computational and informational limits in Gaussian mixture clustering , 2006, ICML '06.

[6]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[7]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[8]  Jiong Yang,et al.  Mining High-Dimensional Data , 2010, Data Mining and Knowledge Discovery Handbook.

[9]  Kamalika Chaudhuri,et al.  Learning mixtures of distributions , 2007 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[12]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[13]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[14]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[15]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[16]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[17]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[18]  René Vidal,et al.  Sparse subspace clustering , 2009, CVPR.

[19]  Nathan Srebro Are There Local Maxima in the Infinite-Sample Likelihood of Gaussian Mixture Estimation? , 2007, COLT.

[20]  Larry A. Wasserman,et al.  Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation , 2013, NIPS.

[21]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[23]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[24]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[25]  Rong Jin,et al.  A New Analysis of Compressive Sensing by Stochastic Proximal Gradient Descent , 2013, ArXiv.

[26]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[27]  Yin Zhang,et al.  Fixed-Point Continuation for l1-Minimization: Methodology and Convergence , 2008, SIAM J. Optim..

[28]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[29]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[30]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[31]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[32]  I. Johnstone High Dimensional Statistical Inference and Random Matrices , 2006, math/0611589.

[33]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[34]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[35]  Adam Tauman Kalai,et al.  Disentangling Gaussians , 2012, Commun. ACM.

[36]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[37]  Jonathan H. Huggins Provably Learning Mixtures of Gaussians and More , 2010 .

[38]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[39]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[40]  Laurence B. Milstein,et al.  Chernoff-Type Bounds for the Gaussian Error Function , 2011, IEEE Transactions on Communications.

[41]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[42]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[43]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[44]  S. Shankar Sastry,et al.  Generalized Principal Component Analysis , 2016, Interdisciplinary applied mathematics.

[45]  Jon M. Kleinberg,et al.  On learning mixtures of heavy-tailed distributions , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[46]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[47]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[48]  Sanjoy Dasgupta,et al.  Learning Mixtures of Gaussians using the k-means Algorithm , 2009, ArXiv.

[49]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[50]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[51]  Eran Halperin,et al.  A rigorous analysis of population stratification with limited data , 2007, SODA '07.

[52]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[53]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2005, COLT.

[54]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.