Clustering Large Graphs via the Singular Value Decomposition

We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k = 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m points; this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right.Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.

[1]  T S Huang,et al.  Image restoration by singular value decomposition. , 1975, Applied optics.

[2]  H. Andrews,et al.  Singular value decompositions and digital image processing , 1976 .

[3]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[4]  János Komlós,et al.  The eigenvalues of random symmetric matrices , 1981, Comb..

[5]  M. Jambu,et al.  Cluster analysis and data analysis , 1985 .

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[9]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[10]  J. A. Robinson Singular value decomposition for approximate block matching in image coding , 1995 .

[11]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[12]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[13]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[14]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[15]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[16]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[17]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[18]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[19]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[20]  David M. Mount,et al.  The analysis of a simple k-means clustering algorithm , 2000, SCG '00.

[21]  Rafail Ostrovsky,et al.  Polynomial time approximation schemes for geometric k-clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[22]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[23]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[24]  Petros Drineas,et al.  Fast Monte-Carlo algorithms for approximate matrix multiplication , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[25]  Dimitris Achlioptas,et al.  Fast computation of low rank matrix approximations , 2001, STOC '01.

[26]  Michael Molloy,et al.  A sharp threshold in proof complexity , 2001, STOC '01.

[27]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[28]  Rafail Ostrovsky,et al.  Polynomial-time approximation schemes for geometric min-sum median clustering , 2002, JACM.

[29]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[30]  Ziv Bar-Yossef,et al.  Sampling lower bounds via information theory , 2003, STOC '03.

[31]  Limsoon Wong,et al.  DATA MINING TECHNIQUES , 2003 .