Adaptive Pruning of Neural Language Models for Mobile Devices

Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity. In a software keyboard application on mobile devices, this translates into higher power consumption and shorter battery life. This paper represents the first attempt, to our knowledge, in exploring accuracy-efficiency tradeoffs for NLMs. Building on quasi-recurrent neural networks (QRNNs), we apply pruning techniques to provide a "knob" to select different operating points. In addition, we propose a simple technique to recover some perplexity using a negligible amount of memory. Our empirical evaluations consider both perplexity as well as energy consumption on a Raspberry Pi, where we demonstrate which methods provide the best perplexity-power consumption operating point. At one operating point, one of the techniques is able to provide energy savings of 40% over the state of the art with only a 17% relative increase in perplexity.

[1]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Balaraman Ravindran,et al.  Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[5]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[8]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[9]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[10]  Carlo Meghini,et al.  Deep learning for decentralized parking lot occupancy detection , 2017, Expert Syst. Appl..

[11]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[12]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[13]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[16]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning , 2016, ArXiv.

[17]  Ben Krause,et al.  Explorer Dynamic Evaluation of Neural Sequence Models , 2018 .

[18]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[19]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[20]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[21]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[22]  Peter Dayan,et al.  Fast Parametric Learning with Activation Memorization , 2018, ICML.

[23]  Jimmy J. Lin,et al.  An Experimental Analysis of the Power Consumption of Convolutional Neural Networks for Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[25]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.