Sublinear Time Approximation of Text Similarity Matrices

We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for n data points requires Omega(n^2) similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystrom method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-s approximation with just O(ns) similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in tasks of document classification, sentence similarity, and cross-document coreference.

[1]  Mandar Joshi,et al.  Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling , 2020, ArXiv.

[2]  Lorenzo Rosasco,et al.  Kernel methods through the roof: handling billions of points efficiently , 2020, NeurIPS.

[3]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[4]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[5]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[6]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[7]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Charu C. Aggarwal,et al.  Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding , 2019, KDD.

[10]  Rajarshi Das,et al.  Optimal Transport-based Alignment of Learned Character Representations for String Similarity , 2019, ACL.

[11]  Liang Zhao,et al.  CUR Low Rank Approximation of a Matrix at Sub-linear Cost , 2019, ArXiv.

[12]  David P. Woodruff,et al.  Sample-Optimal Low-Rank Approximation of Distance Matrices , 2019, COLT.

[13]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[14]  J. Weston,et al.  Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2019 .

[15]  Peter Tiño,et al.  Supervised low rank indefinite kernel approximation using minimum enclosing balls , 2018, Neurocomputing.

[16]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[17]  David P. Woodruff,et al.  Sublinear Time Low-Rank Approximation of Distance Matrices , 2018, NeurIPS.

[18]  Thomas Gärtner,et al.  Scalable Learning in Reproducing Kernel Krein Spaces , 2018, ICML.

[19]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[20]  David P. Woodruff,et al.  Sublinear Time Low-Rank Approximation of Positive Semidefinite Matrices , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[21]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[22]  N. Kishore Kumar,et al.  Literature survey on low rank approximation of matrices , 2016, ArXiv.

[23]  Ryan Cotterell,et al.  Weighting Finite-State Transductions With Neural Context , 2016, NAACL.

[24]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[25]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[26]  Zhihua Zhang,et al.  Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition , 2015, J. Mach. Learn. Res..

[27]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[28]  Frank-Michael Schleif,et al.  Metric and non-metric proximity transformations at linear costs , 2014, Neurocomputing.

[29]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[30]  Xiaoqiang Luo,et al.  Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation , 2014, ACL.

[31]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Piek T. J. M. Vossen,et al.  Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution , 2014, LREC.

[33]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[34]  Zhihua Zhang,et al.  Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling , 2013, J. Mach. Learn. Res..

[35]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[36]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[37]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[38]  Bohyung Han,et al.  A fast nearest neighbor search algorithm by nonlinear embedding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  B. Piccoli,et al.  Generalized Wasserstein Distance and its Application to Transport Equations with Source , 2012, 1206.3219.

[40]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[41]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[42]  Ameet Talwalkar,et al.  Matrix Coherence and the Nystrom Method , 2010, UAI.

[43]  Maya R. Gupta,et al.  Learning kernels from indefinite similarities , 2009, ICML '09.

[44]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[45]  Ivor W. Tsang,et al.  Improved Nyström low-rank approximation and error analysis , 2008, ICML '08.

[46]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[47]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[48]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[49]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[50]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[51]  Michael W. Mahoney,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[52]  Jitendra Malik,et al.  Spectral Partitioning with Indefinite Kernels Using the Nyström Extension , 2002, ECCV.

[53]  Y. Rubner,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[54]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[55]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[56]  S. Goreinov,et al.  A Theory of Pseudoskeleton Approximations , 1997 .

[57]  Michael T. Orchard,et al.  A fast nearest-neighbor search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[58]  Yuanzhe Xi,et al.  Fast and stable deterministic approximation of general symmetric kernel matrices in high dimensions , 2021, ArXiv.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[61]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[62]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[63]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[64]  Jitendra Malik,et al.  Spectral Partitioning with Inde nite Kernels using the Nystr om Extension , 2002 .

[65]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[66]  Michael W. Mahoney Stat260/cs294: Randomized Algorithms for Matrices and Data , 2022 .