Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods

Given an m ×n matrix A and an integer k less than the rank of A, the “best” rank k approximation to A that minimizes the error with respect to the Frobenius norm is Ak, which is obtained by projecting A on the top k left singular vectors of A. While Ak is routinely used in data analysis, it is difficult to interpret and understand it in terms of the original data, namely the columns and rows of A. For example, these columns and rows often come from some application domain, whereas the singular vectors are linear combinations of (up to all) the columns or rows of A. We address the problem of obtaining low-rank approximations that are directly interpretable in terms of the original columns or rows of A. Our main results are two polynomial time randomized algorithms that take as input a matrix A and return as output a matrix C, consisting of a “small” (i.e., a low-degree polynomial in k, 1/e, and log(1/δ)) number of actual columns of A such that ||A–CC+A||F ≤(1+e) ||A–Ak||F with probability at least 1–δ. Our algorithms are simple, and they take time of the order of the time needed to compute the top k right singular vectors of A. In addition, they sample the columns of A via the method of “subspace sampling,” so-named since the sampling probabilities depend on the lengths of the rows of the top singular vectors and since they ensure that we capture entirely a certain subspace of interest.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  R. Bhatia Matrix Analysis , 1996 .

[4]  S. Goreinov,et al.  A Theory of Pseudoskeleton Approximations , 1997 .

[5]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[6]  G. W. Stewart,et al.  Four algorithms for the the efficient computation of truncated pivoted QR approximations to a sparse matrix , 1999, Numerische Mathematik.

[7]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[8]  S. Schreiber,et al.  Vector algebra in the analysis of genome-wide expression data , 2002, Genome Biology.

[9]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[10]  R. Altman,et al.  Finding haplotype tagging SNPs by use of principal components analysis. , 2004, American journal of human genetics.

[11]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[12]  G. W. Stewart,et al.  Error Analysis of the Quasi-Gram-Schmidt Algorithm , 2005, SIAM J. Matrix Anal. Appl..

[13]  Luis Rademacher,et al.  Matrix Approximation and Projective Clustering via Iterative Sampling , 2005 .

[14]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[15]  S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, ACM-SIAM Symposium on Discrete Algorithms.

[16]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[17]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[18]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[19]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[20]  Michael W. Mahoney,et al.  Intra- and interpopulation genotype reconstruction from tagging SNPs. , 2006, Genome research.