On the clustering of large-scale data: A matrix-based approach

Nowadays, the analysis of large amounts of digital documents become a hot research topic since the libraries and database are converted electronically, such as PUBMED and IEEE publications. The ubiquitous phenomenon of massive data and sparse information imposes considerable challenges in data mining research. In this paper, we propose a theoretical framework, Exemplar-based Low-rank sparse Matrix Decomposition (ELMD), to cluster large-scale datasets. Specifically, given a data matrix, ELMD first computes a representative data subspace and a near-optimal low-rank approximation. Then, the cluster centroids and indicators are obtained through matrix decomposition, in which we require that the cluster centroids lie within the representative data subspace. From a theoretical perspective, we show the correctness and convergence of the ELMD algorithm, and provide detailed analysis on its efficiency. Through extensive experiments performed on both synthetic and real datasets, we demonstrate the superior performance of ELMD for clustering large-scale data.

[1]  Zhen Wu,et al.  Ranked Centroid Projection: A Data Visualization Approach With Self-Organizing Maps , 2008, IEEE Transactions on Neural Networks.

[2]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Michael W. Berry,et al.  Algorithm 844: Computing sparse reduced-rank approximations to sparse matrices , 2005, TOMS.

[5]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[6]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[7]  Martin Wattenberg,et al.  Mapping Text with Phrase Nets , 2009, IEEE Transactions on Visualization and Computer Graphics.

[8]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dimitris Achlioptas,et al.  Fast computation of low-rank matrix approximations , 2007, JACM.

[10]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[11]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[13]  Philip S. Yu,et al.  Colibri: fast mining of large static and dynamic graphs , 2008, KDD.

[14]  Jing Hua,et al.  Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115) , 2009, IEEE Transactions on Visualization and Computer Graphics.

[15]  Gene H. Golub,et al.  Matrix computations , 1983 .

[16]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[17]  Chris H. Q. Ding,et al.  Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method , 2006, AAAI.

[18]  Jimeng Sun,et al.  Less is More: Sparse Graph Mining with Compact Matrix Decomposition , 2008, Stat. Anal. Data Min..

[19]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .