An empirical comparison of sampling techniques for matrix column subset selection

Column subset selection (CSS) is the problem of selecting a small portion of columns from a large data matrix as one form of interpretable data summarization. Leverage score sampling, which enjoys both sound theoretical guarantee and superior empirical performance, is widely recognized as the state-of-the-art algorithm for column subset selection. In this paper, we revisit iterative norm sampling, another sampling based CSS algorithm proposed even before leverage score sampling, and demonstrate its competitive performance under a wide range of experimental settings. We also compare iterative norm sampling with several of its other competitors and show its superior performance in terms of both approximation accuracy and computational efficiency. We conclude that further theoretical investigation and practical consideration should be devoted to iterative norm sampling in column subset selection.

[1]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[2]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[3]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.

[4]  C. Pan,et al.  Rank-Revealing QR Factorizations and the Singular Value Decomposition , 1992 .

[5]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[6]  Michael W. Mahoney,et al.  Efficient Genomewide Selection of PCA‐Correlated tSNPs for Genotype Imputation , 2011, Annals of human genetics.

[7]  Aarti Singh,et al.  Provably Correct Active Sampling Algorithms for Matrix Column Subset Selection with Missing Data , 2015, ArXiv.

[8]  Michael W. Mahoney,et al.  CUR from a Sparse Optimization Viewpoint , 2010, NIPS.

[9]  Aarti Singh,et al.  Column Subset Selection with Missing Data via Active Sampling , 2015, AISTATS.

[10]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[11]  T. Chan Rank revealing QR factorizations , 1987 .

[12]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[13]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[14]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[15]  W. Bajwa,et al.  Column Subset Selection with Missing Data , 2010 .

[16]  Jelena Kovacevic,et al.  Signal recovery on graphs: Random versus experimentally designed sampling , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[17]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[18]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[19]  Michael W. Mahoney,et al.  Intra- and interpopulation genotype reconstruction from tagging SNPs. , 2006, Genome research.

[20]  Rong Jin,et al.  An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection , 2015, ICML.