Large-scale subspace clustering using sketching and validation

The nowadays massive amounts of generated and communicated data present major challenges in their processing. While capable of successfully classifying nonlinearly separable objects in various settings, subspace clustering (SC) methods incur prohibitively high computational complexity when processing large-scale data. Inspired by the random sampling and consensus (RANSAC) approach to robust regression, the present paper introduces a randomized scheme for SC, termed sketching and validation (SkeVa-)SC, tailored for large-scale data. At the heart of SkeVa-SC lies a randomized scheme for approximating the underlying probability density function of the observed data by kernel smoothing arguments. Sparsity in data representations is also exploited to reduce the computational burden of SC, while achieving high clustering accuracy. Performance analysis as well as extensive numerical tests on synthetic and real data corroborate the potential of SkeVa-SC and its competitive performance relative to state-of-the-art scalable SC approaches. Keywords: Subspace clustering, big data, kernel smoothing, randomization, sketching, validation, sparsity.

[1]  D. Butler Data, data everywhere... , 2005, Nature Structural &Molecular Biology.

[2]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[3]  T. Cover,et al.  Rate Distortion Theory , 2001 .

[4]  P. Hall Large Sample Optimality of Least Squares Cross-Validation in Density Estimation , 1983 .

[5]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[6]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[8]  Gilad Lerman,et al.  Median K-Flats for hybrid linear modeling with many outliers , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[9]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[10]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[11]  甘利 俊一 Differential geometry in statistical inference , 1987 .

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  René Vidal,et al.  Online Clustering of Moving Hyperplanes , 2006, NIPS.

[14]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[15]  Daniel Boley,et al.  Bregman Divergences and Triangle Inequality , 2013, SDM.

[16]  Georgios B. Giannakis,et al.  Sketch and Validate for Big Data Clustering , 2015, IEEE Journal of Selected Topics in Signal Processing.

[17]  Zhang Yi,et al.  Scalable Sparse Subspace Clustering , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Gilad Lerman,et al.  Hybrid Linear Modeling via Local Best-Fit Flats , 2010, International Journal of Computer Vision.

[19]  L. Devroye Exponential Inequalities in Nonparametric Estimation , 1991 .

[20]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[21]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[22]  Fevzi Alimo Methods of Combining Multiple Classiiers Based on Diierent Representations for Pen-based Handwritten Digit Recognition , 1996 .

[23]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[24]  John Wright,et al.  Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  J. Marron,et al.  Smoothed cross-validation , 1992 .

[28]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[29]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[30]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[31]  R. Vidal A TUTORIAL ON SUBSPACE CLUSTERING , 2010 .

[32]  D. W. Scott,et al.  Biased and Unbiased Cross-Validation in Density Estimation , 1987 .

[33]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[34]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[35]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Yong Yu,et al.  Robust Subspace Segmentation by Low-Rank Representation , 2010, ICML.

[37]  L. Nelson Data, data everywhere. , 1997, Critical care medicine.

[38]  Takeo Kanade,et al.  A Multibody Factorization Method for Independently Moving Objects , 1998, International Journal of Computer Vision.

[39]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.