Anchor & Transform: Learning Sparse Representations of Discrete Objects

Learning continuous representations of discrete objects such as text, users, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often ignore the underlying structures (e.g. natural groupings and similarities) and embed the objects independently into individual vectors. As a result, existing methods do not scale to large vocabulary sizes. In this paper, we design a Bayesian nonparametric prior for embeddings that encourages sparsity and leverages natural groupings among objects. We derive an approximate inference algorithm based on Small Variance Asymptotics which yields a simple and natural algorithm for learning a small set of anchor embeddings and a sparse transformation matrix. We call our method Anchor & Transform (ANT) as the embeddings of discrete objects are a sparse linear combination of the anchors, weighted according to the transformation matrix. ANT is scalable, flexible, end-to-end trainable, and allows the user to incorporate domain knowledge about object relationships. On text classification and language modeling benchmarks, ANT demonstrates stronger performance with fewer parameters as compared to existing compression baselines.

[1]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[2]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[3]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[4]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[5]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[6]  Michael I. Jordan,et al.  Feature allocations, probability functions, and paintboxes , 2013, 1301.6647.

[7]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[8]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[9]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[10]  Jay Pujara,et al.  Mining Knowledge Graphs From Text , 2018, WSDM.

[11]  Liang Liu,et al.  An efficient deep learning hashing neural network for mobile visual search , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[12]  Michael J Daniels,et al.  A Bayesian nonparametric approach to marginal structural models for point treatments and a continuous or survival outcome. , 2017, Biostatistics.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[16]  Hideki Nakayama,et al.  Compressing Word Embeddings via Deep Compositional Code Learning , 2017, ICLR.

[17]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Eric P. Xing,et al.  Feature Selection via Block-Regularized Regression , 2008, UAI.

[20]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[21]  Qian Liu,et al.  Task-oriented Word Embedding for Text Classification , 2018, COLING.

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[23]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[24]  Michael I. Jordan,et al.  Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models , 2012, NIPS.

[25]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[26]  Chun Chen,et al.  Efficient manifold ranking for image retrieval , 2011, SIGIR.

[27]  Wenhu Chen,et al.  How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection , 2019, NAACL.

[28]  Andreas Krause,et al.  Practical Coreset Constructions for Machine Learning , 2017, 1703.06476.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Ivan Markovsky,et al.  Low Rank Approximation - Algorithms, Implementation, Applications , 2018, Communications and Control Engineering.

[31]  Stefan Thater,et al.  A Mixture Model for Learning Multi-Sense Word Embeddings , 2017, *SEMEVAL.

[32]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[33]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[34]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[35]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Aravindan Vijayaraghavan,et al.  Towards Learning Sparsely Used Dictionaries with Arbitrary Supports , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[37]  Michael I. Jordan,et al.  Bayesian Nonparametrics: Hierarchical Bayesian nonparametric models with applications , 2010 .

[38]  Maarten de Rijke,et al.  Manifold Learning for Rank Aggregation , 2018, WWW.

[39]  Lawrence Carin,et al.  A Stick-Breaking Construction of the Beta Process , 2010, ICML.

[40]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[41]  James Zou,et al.  Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems , 2019, 2021 IEEE International Symposium on Information Theory (ISIT).

[42]  Andrey V. Savchenko,et al.  Compression of Recurrent Neural Networks for Efficient Language Modeling , 2019, Appl. Soft Comput..

[43]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[44]  Noah A. Smith,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[45]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[46]  Quoc V. Le,et al.  Neural Input Search for Large Scale Recommendation Models , 2019, KDD.

[47]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[48]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[49]  Zhirong Yang,et al.  Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition , 2018, ICONIP.

[50]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[51]  Michael I. Jordan,et al.  MAD-Bayes: MAP-based Asymptotic Derivations from Bayes , 2012, ICML.

[52]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[53]  Ling Shao,et al.  Learning to Hash With Optimized Anchor Embedding for Scalable Retrieval , 2017, IEEE Transactions on Image Processing.

[54]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[55]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[56]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[57]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[58]  Le Nguyen Hoai Nam,et al.  Integrating Low-rank Approximation and Word Embedding for Feature Transformation in the High-dimensional Text Classification , 2017, KES.

[59]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[60]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[61]  Ke Jiang,et al.  Small-Variance Asymptotics for Hidden Markov Models , 2013, NIPS.

[62]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[63]  Yiran Chen,et al.  Holistic SparseCNN: Forging the Trident of Accuracy, Speed, and Size , 2016, ArXiv.

[64]  Yizhou Sun,et al.  Learning K-way D-dimensional Discrete Code For Compact Embedding Representations , 2017, ICML.

[65]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[66]  Ole Winther,et al.  Hash Embeddings for Efficient Word Representations , 2017, NIPS.

[67]  Rahul Goel,et al.  Online Embedding Compression for Text Classification using Low Rank Matrix Factorization , 2018, AAAI.

[68]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[69]  Zoubin Ghahramani,et al.  Nonparametric Bayesian Sparse Factor Models with application to Gene Expression modelling , 2010, The Annals of Applied Statistics.

[70]  Andrew Gordon Wilson,et al.  Probabilistic FastText for Multi-Sense Word Embeddings , 2018, ACL.

[71]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[72]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[73]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[74]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[75]  Wonyong Sung,et al.  Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[76]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[77]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[78]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[79]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[80]  Stratis Ioannidis,et al.  Guess Who Rated This Movie: Identifying Users Through Subspace Clustering , 2012, UAI.

[81]  T. Griffiths,et al.  Bayesian nonparametric latent feature models , 2007 .

[82]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[83]  Zhi Jin,et al.  Compressing Neural Language Models by Sparse Word Representations , 2016, ACL.

[84]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[85]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[86]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[87]  S. Mallat A wavelet tour of signal processing , 1998 .

[88]  Wenlin Chen,et al.  Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[89]  Dmitry P. Vetrov,et al.  Bayesian Compression for Natural Language Processing , 2018, EMNLP.

[90]  Michael I. Jordan,et al.  Hierarchical Bayesian Nonparametric Models with Applications , 2008 .

[91]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.