An improved approximation algorithm for the column subset selection problem

We consider the problem of selecting the "best" subset of exactly k columns from an m x n matrix A. In particular, we present and analyze a novel two-stage algorithm that runs in O(min{mn2, m2n}) time and returns as output an m x k matrix C consisting of exactly k columns of A. In the first stage (the randomized stage), the algorithm randomly selects O(k log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-k right singular subspace of A. In the second stage (the deterministic stage), the algorithm applies a deterministic column-selection procedure to select and return exactly k columns from the set of columns selected in the first stage. Let C be the m x k matrix containing those k columns, let PC denote the projection matrix onto the span of those columns, and let Ak denote the "best" rank-k approximation to the matrix A as computed with the singular value decomposition. Then, we prove that [EQUATION] with probability at least 0.7. This spectral norm bound improves upon the best previously-existing result (of Gu and Eisenstat [21]) for the spectral norm version of this Column Subset Selection Problem. We also prove that [EQUATION] with the same probability. This Frobenius norm bound is only a factor of √k log k worse than the best previously existing existential result and is roughly O(√k!) better than the best previous algorithmic result (both of Deshpande et al. [11]) for the Frobenius norm version of this Column Subset Selection Problem.

[1]  L. Foster Rank and null space calculations using matrix decomposition without column interchanges , 1986 .

[2]  W. Krzanowski Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components , 1987 .

[3]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[4]  Per Christian Hansen,et al.  Some Applications of the Rank Revealing QR Factorization , 1992, SIAM J. Sci. Comput..

[5]  C. Pan,et al.  Rank-Revealing QR Factorizations and the Singular Value Decomposition , 1992 .

[6]  P. Tang,et al.  Bounds on Singular Values Revealed by QR Factorizations , 1999 .

[7]  Per Christian Hansen,et al.  Low-rank revealing QR factorizations , 1994, Numerical Linear Algebra with Applications.

[8]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[9]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[10]  Christian H. Bischof,et al.  Computing rank-revealing QR factorizations of dense matrices , 1998, TOMS.

[11]  Christian H. Bischof,et al.  Algorithm 782: codes for rank-revealing QR factorizations of dense matrices , 1998, TOMS.

[12]  G. W. Stewart,et al.  Four algorithms for the the efficient computation of truncated pivoted QR approximations to a sparse matrix , 1999, Numerische Mathematik.

[13]  C. Pan On the existence and computation of rank-revealing LU factorizations , 2000 .

[14]  T. Chan Rank Revealing OR Factorizations * , 2001 .

[15]  Prabhakar Raghavan,et al.  Competitive recommendation systems , 2002, STOC '02.

[16]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[18]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[19]  Per Christian Hansen,et al.  UTV Tools: Matlab templates for rank-revealing UTV decompositions , 1999, Numerical Algorithms.

[20]  ShashuaAmnon,et al.  Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach , 2005 .

[21]  Kezhi Mao,et al.  Identifying critical variables of principal components for unsupervised feature selection , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[23]  V. Rokhlin,et al.  A randomized algorithm for the approximation of matrices , 2006 .

[24]  L. Foster,et al.  COMPARISON OF RANK REVEALING ALGORITHMS APPLIED TO MATRICES WITH WELL DEFINED NUMERICAL RANKS , 2006 .

[25]  S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, ACM-SIAM Symposium on Discrete Algorithms.

[26]  S. Muthukrishnan,et al.  Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods , 2006, APPROX-RANDOM.

[27]  Petros Drineas,et al.  Tensor-CUR decompositions for tensor-based data , 2006, KDD '06.

[28]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[29]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[30]  Mark Rudelson,et al.  Sampling from large matrices: An approach through geometric functional analysis , 2005, JACM.

[31]  Jimeng Sun,et al.  Less is More: Compact Matrix Decomposition for Large Sparse Graphs , 2007, SDM.

[32]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[33]  Gene H. Golub,et al.  Numerical methods for solving linear least squares problems , 1965, Milestones in Matrix Computation.

[34]  Michael W. Mahoney,et al.  PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations , 2007, PLoS genetics.

[35]  M. Magdon-Ismail,et al.  Finding Maximum Volume Sub-matrices of a Matrix , 2007 .

[36]  V. Rokhlin,et al.  A fast randomized algorithm for the approximation of matrices ✩ , 2007 .

[37]  Christos Boutsidis,et al.  Unsupervised feature selection for principal components analysis , 2008, KDD.

[38]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[39]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[40]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[41]  J. A. Díaz-García,et al.  SENSITIVITY ANALYSIS IN LINEAR REGRESSION , 2022 .