Sketching Transformed Matrices with Applications to Natural Language Processing

Suppose we are given a large matrix $A=(a_{i,j})$ that cannot be stored in memory but is in a disk or is presented in a data stream. However, we need to compute a matrix decomposition of the entry-wisely transformed matrix, $f(A):=(f(a_{i,j}))$ for some function $f$. Is it possible to do it in a space efficient way? Many machine learning applications indeed need to deal with such large transformed matrices, for example word embedding method in NLP needs to work with the pointwise mutual information (PMI) matrix, while the entrywise transformation makes it difficult to apply known linear algebraic tools. Existing approaches for this problem either need to store the whole matrix and perform the entry-wise transformation afterwards, which is space consuming or infeasible, or need to redesign the learning method, which is application specific and requires substantial remodeling. In this paper, we first propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix. It works for a general family of transformations with provable small error bounds and thus can be used as a primitive in downstream learning tasks. We then apply this primitive to a concrete application: low-rank approximation. We show that our approach obtains small error and is efficient in both space and time. We complement our theoretical results with experiments on synthetic and real data.

[1]  P. Wedin Perturbation theory for pseudo-inverses , 1973 .

[2]  David P. Woodruff,et al.  Relative Error Tensor Low Rank Approximation , 2017, Electron. Colloquium Comput. Complex..

[3]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[4]  David P. Woodruff,et al.  Regularized Weighted Low Rank Approximation , 2019, NeurIPS.

[5]  David P. Woodruff,et al.  Distributed low rank approximation of implicit functions of a matrix , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[6]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[7]  David P. Woodruff,et al.  Total Least Squares Regression in Input Sparsity Time , 2019, NeurIPS.

[8]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[11]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[12]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[13]  Robert Krauthgamer,et al.  Streaming symmetric norms via measure concentration , 2015, STOC.

[14]  Alexandr Andoni,et al.  High frequency moments via max-stability , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  M. Panella Associate Editor of the Journal of Computer and System Sciences , 2014 .

[16]  David P. Woodruff,et al.  Weighted low rank approximations with provable guarantees , 2016, STOC.

[17]  David P. Woodruff,et al.  Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors , 2016, PODS.

[18]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[19]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[20]  Nikhil Srivastava,et al.  Graph sparsification by effective resistances , 2008, SIAM J. Comput..

[21]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[22]  Yin Tat Lee,et al.  An improved cutting plane method for convex optimization, convex-concave games, and its applications , 2020, STOC.

[23]  Anatoli Torokhti,et al.  Generalized Rank-Constrained Matrix Approximations , 2007, SIAM J. Matrix Anal. Appl..

[24]  Zhao Song,et al.  A Matrix Chernoff Bound for Strongly Rayleigh Distributions and Spectral Sparsifiers from a few Random Spanning Trees , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[25]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[26]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[27]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[28]  Yin Tat Lee,et al.  Leverage Score Sampling for Faster Accelerated Regression and ERM , 2017, ALT.

[29]  Yin Tat Lee,et al.  A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[30]  V. P. Pauca,et al.  Nonnegative matrix factorization for spectral data analysis , 2006 .

[31]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[32]  Ruosong Wang,et al.  Tight Bounds for 𝓁p Oblivious Subspace Embeddings , 2019, ACM-SIAM Symposium on Discrete Algorithms.

[33]  Gary L. Miller,et al.  Iterative Row Sampling , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[34]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[35]  Alexandr Andoni,et al.  Subspace Embedding and Linear Regression with Orlicz Norm , 2018, ICML.

[36]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[37]  Michael W. Mahoney Boyd,et al.  Randomized Algorithms for Matrices and Data , 2010 .

[38]  David P. Woodruff,et al.  Average Case Column Subset Selection for Entrywise 퓁1-Norm Loss , 2019, NeurIPS.

[39]  Ruosong Wang,et al.  Dimensionality Reduction for Tukey Regression , 2019, ICML.

[40]  David P. Woodruff,et al.  Robust and Sample Optimal Algorithms for PSD Low Rank Approximation , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[41]  David P. Woodruff,et al.  Faster Algorithms for Binary Matrix Factorization , 2019, ICML.

[42]  David P. Woodruff,et al.  Sublinear Time Low-Rank Approximation of Distance Matrices , 2018, NeurIPS.

[43]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[44]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[45]  David P. Woodruff,et al.  Towards a Zero-One Law for Column Subset Selection , 2018, NeurIPS.

[46]  David P. Woodruff,et al.  Revisiting Frequency Moment Estimation in Random Order Streams , 2018, ICALP.

[47]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[48]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[49]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[50]  David P. Woodruff,et al.  Sample-Optimal Low-Rank Approximation of Distance Matrices , 2019, COLT.

[51]  Chunyan Miao,et al.  A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution , 2015, EMNLP.

[52]  Vladimir Braverman,et al.  An Optimal Algorithm for Large Frequency Moments Using O(n^(1-2/k)) Bits , 2014, APPROX-RANDOM.

[53]  David P. Woodruff,et al.  Fast Regression with an `∞ Guarantee∗ , 2017 .

[54]  David P. Woodruff,et al.  Low Rank Approximation with Entrywise ℓ1-Norm Error , 2016, ArXiv.

[55]  J. van Leeuwen,et al.  Theoretical Computer Science , 2003, Lecture Notes in Computer Science.

[56]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[57]  Rafail Ostrovsky,et al.  Recursive Sketching For Frequency Moments , 2010, ArXiv.

[58]  Zhengyou Zhang,et al.  Parameter estimation techniques: a tutorial with application to conic fitting , 1997, Image Vis. Comput..

[59]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[60]  David P. Woodruff,et al.  Near Optimal Sketching of Low-Rank Tensor Regression , 2017, NIPS.

[61]  Yin Tat Lee,et al.  Solving tall dense linear programs in nearly linear time , 2020, STOC.

[62]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[63]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[64]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[65]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[66]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[67]  David P. Woodruff,et al.  D ec 2 01 7 Sketching for Kronecker Product Regression and P-splines Huaian , 2018 .

[68]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[69]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[70]  David P. Woodruff,et al.  Input Sparsity and Hardness for Robust Subspace Approximation , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[71]  Lawrence K. Saul,et al.  Modeling distances in large-scale networks by matrix factorization , 2004, IMC '04.

[72]  Hal R. Varian,et al.  Big Data: New Tricks for Econometrics , 2014 .

[73]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[74]  Pravin M. Vaidya,et al.  A new algorithm for minimizing convex functions over convex sets , 1996, Math. Program..

[75]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[76]  Maria-Pia Victoria-Feser,et al.  Robust VIF regression with application to variable selection in large data sets , 2013, 1304.5349.

[77]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[78]  Daniel M. Kane,et al.  Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[79]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[80]  Aaron Schild,et al.  An almost-linear time algorithm for uniform random spanning tree generation , 2017, STOC.

[81]  Yin Tat Lee,et al.  A near-optimal algorithm for approximating the John Ellipsoid , 2019, COLT.

[82]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[83]  David P. Woodruff,et al.  Low rank approximation with entrywise l1-norm error , 2017, STOC.

[84]  David P. Woodruff,et al.  Optimal Sketching for Kronecker Product Regression and Low Rank Approximation , 2019, NeurIPS.

[85]  Ruosong Wang,et al.  Tight Bounds for the Subspace Sketch Problem with Applications , 2019, SODA.

[86]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[87]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[88]  Le Song,et al.  Communication Efficient Distributed Kernel Principal Component Analysis , 2015, KDD.