Learning Compact Neural Word Embeddings by Parameter Space Sharing

The word embedding vectors obtained from neural word embedding methods, such as vLBL models and SkipGram, have become an important fundamental resource for tackling a wide variety of tasks in the artificial intelligence field. This paper focuses on the fact that the model size of high-quality embedding vectors is relatively large, i.e., more than 1GB. We propose a learning framework that can provide a set of 'compact' embedding vectors for the purpose of enhancing 'usability' in actual applications. Our proposed method incorporates parameter sharing constraints into the optimization problem. These additional constraints force the embedding vectors to share parameter values, which significantly shrinks model size. We investigate the trade-off between quality and model size of embedding vectors for several linguistic benchmark datasets, and show that our method can significantly reduce the model size while maintaining the task performance of conventional methods.

[1]  Ming Zhou,et al.  A Recursive Recurrent Neural Network for Statistical Machine Translation , 2014, ACL.

[2]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[3]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[4]  Masaaki Nagata,et al.  A Unified Learning Framework of Skip-Grams and Global Vectors , 2015, ACL.

[5]  M. Hestenes Multiplier and gradient methods , 1969 .

[6]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[7]  Geoffrey E. Hinton,et al.  A Better Way to Pretrain Deep Boltzmann Machines , 2012, NIPS.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[12]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[13]  Harvey J. Everett Generalized Lagrange Multiplier Method for Solving Problems of Optimum Allocation of Resources , 1963 .

[14]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[15]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[16]  Enrico H. Gerding,et al.  Twenty-Eighth AAAI Conference on Artificial Intelligence , 2014, AAAI 2014.

[17]  Droniou Alain,et al.  Gated Autoencoders with Tied Input Weights , 2013, ICML 2013.

[18]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[19]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[20]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[21]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[22]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[23]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[26]  G. Saridis,et al.  Journal of Optimization Theory and Applications Approximate Solutions to the Time-invariant Hamilton-jacobi-bellman Equation 1 , 1998 .

[27]  Masaaki Nagata,et al.  Fused Feature Representation Discovery for High-Dimensional and Sparse Data , 2014, AAAI.

[28]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[29]  Guido Sanguinetti,et al.  Advances in Neural Information Processing Systems 24 , 2011 .

[30]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[31]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[32]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[33]  Olivier Sigaud,et al.  Gated Autoencoders with Tied Input Weights , 2013, ICML.

[34]  M. J. D. Powell,et al.  A method for nonlinear constraints in minimization problems , 1969 .

[35]  L. Goddard,et al.  Operations Research (OR) , 2007 .