More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data

We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ellz error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n x d matrix into a small, ε-dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1+ε)-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of non-zeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery [21](Best paper of KDD 2013) and answer affirmably, and in a more general setting, his open question of computing such a coreset. Source code is provided for reproducing the experiments and integration with existing and future algorithms.

[1]  Manfred Georg,et al.  On using nearly-independent feature families for high precision and confidence , 2012, Machine Learning.

[2]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[3]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[4]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[5]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[6]  Christos Boutsidis,et al.  Random Projections for the Nonnegative Least-Squares Problem , 2008, ArXiv.

[7]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[8]  Kasturi R. Varadarajan,et al.  Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.

[9]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[10]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[11]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[12]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[13]  Stephen A. Vavasis,et al.  On the Complexity of Nonnegative Matrix Factorization , 2007, SIAM J. Optim..

[14]  Haesun Park,et al.  Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[16]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[17]  Pauli Miettinen,et al.  Interpretable nonnegative matrix decompositions , 2008, KDD.

[18]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[20]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[21]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[22]  David P. Woodruff,et al.  Frequent Directions: Simple and Deterministic Matrix Sketching , 2015, SIAM J. Comput..

[23]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[24]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.