Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition

Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and language modeling, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.

[1]  Rahul Goel,et al.  Online Embedding Compression for Text Classification using Low Rank Matrix Factorization , 2018, AAAI.

[2]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[6]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[7]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Jamin Shin,et al.  On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression , 2019, ArXiv.

[11]  Yuhong Guo,et al.  Time-aware Large Kernel Convolutions , 2020, ICML.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[14]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Shuang Wu,et al.  Slim Embedding Layers for Recurrent Neural Language Models , 2017, AAAI.

[17]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[21]  Kai Yu,et al.  Structured Word Embedding for Low Memory Neural Network Language Model , 2018, INTERSPEECH.

[22]  Yang Li,et al.  GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking , 2018, NeurIPS.

[23]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[24]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[25]  Hideki Nakayama,et al.  Compressing Word Embeddings via Deep Compositional Code Learning , 2017, ICLR.

[26]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[27]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[28]  Valentin Khrulkov,et al.  Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[32]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[34]  Rich Caruana,et al.  Model compression , 2006, KDD '06.