Learning and Inference via Maximum Inner Product Search

A large class of commonly used probabilistic models known as log-linear models are defined up to a normalization constant. Typical learning algorithms for such models require solving a sequence of probabilistic inference queries. These inferences are typically intractable, and are a major bottleneck for learning models with large output spaces. In this paper, we provide a new approach for amortizing the cost of a sequence of related inference queries, such as the ones arising during learning. Our technique relies on a surprising connection with algorithms developed in the past two decades for similarity search in large data bases. Our approach achieves improved running times with provable approximation guarantees. We show that it performs well both on synthetic data and neural language models with large output spaces.

[1]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[2]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[3]  Dan Roth,et al.  On the Hardness of Approximate Reasoning , 1993, IJCAI.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[7]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[8]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[9]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[10]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[11]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[12]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[13]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[15]  George Papandreou,et al.  Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models , 2011, 2011 International Conference on Computer Vision.

[16]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[17]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[18]  Parikshit Ram,et al.  Maximum inner-product search using cone trees , 2012, KDD.

[19]  Parikshit Ram,et al.  Efficient retrieval of recommendations in a matrix factorization framework , 2012, CIKM.

[20]  Subhransu Maji,et al.  On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations , 2013, NIPS.

[21]  Bart Selman,et al.  Taming the Curse of Dimensionality: Discrete Integration by Hashing and Optimization , 2013, ICML.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Bart Selman,et al.  Embed and Project: Discrete Sampling with Universal Hashing , 2013, NIPS.

[24]  Bart Selman,et al.  Optimization With Parity Constraints: From Binary Codes to Discrete Integration , 2013, UAI.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Supratik Chakraborty,et al.  A Scalable Approximate Model Counter , 2013, CP.

[27]  Parikshit Ram,et al.  Fast Exact Max-Kernel Search , 2012, SDM.

[28]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[29]  Bart Selman,et al.  Low-density Parity Constraints for Hashing-Based Discrete Integration , 2014, ICML.

[30]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[31]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[32]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[33]  Ulrich Paquet,et al.  Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces , 2014, RecSys '14.

[34]  Stefano Ermon,et al.  A Hybrid Approach for Probabilistic Inference using Random Projections , 2015, ICML.

[35]  Dimitris Achlioptas,et al.  Stochastic Integration via Error-Correcting Codes , 2015, UAI.

[36]  Nathan Srebro,et al.  On Symmetric and Asymmetric LSHs for Inner Product Search , 2014, ICML.

[37]  Ping Li,et al.  Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) , 2014, UAI.

[38]  Pascal Vincent,et al.  Clustering is Efficient for Approximate Maximum Inner Product Search , 2015, ArXiv.

[39]  Dan Klein,et al.  On the Accuracy of Self-Normalized Log-Linear Models , 2015, NIPS.

[40]  Stefano Ermon,et al.  Importance Sampling over Sets: A New Probabilistic Inference Scheme , 2015, UAI.

[41]  Wei Liu,et al.  Learning Binary Codes for Maximum Inner Product Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Stefano Ermon,et al.  Exact Sampling with Integer Linear Programs and Random Perturbations , 2016, AAAI.

[43]  Stefano Ermon,et al.  Tight Variational Bounds via Random Projections and I-Projections , 2016, AISTATS.