Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs

Intelligent Personal Assistants (IPAs) with the capability of natural language processing (NLP) are increasingly popular in today's mobile devices. Recurrent neural networks (RNNs), especially one of their forms – Long-Short Term Memory networks (LSTMs), are becoming the core machine learning technique applied in the NLP-based IPAs. With the continuously improved performance of mobile GPUs, local processing has become a promising solution to the large data transmission and privacy issues induced by the cloud-centric computations of IPAs. However, LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth. In this study, we aim to explore the memory friendly LSTM on mobile GPUs by hierarchically reducing the off-chip memory accesses. To address the redundant data movements, we propose inter-cell level optimizations that intelligently parallelize the originally sequentially executed LSTM cells (basic units in RNNs, corresponding to neurons in CNNs) to improve the data locality across cells with negligible accuracy loss. To relax the pressure on limited off-chip memory bandwidth, we propose intra-cell level optimizations that dynamically skip the loads and computations of rows in the weight matrices with trivial contribution to the outputs. We also introduce a light-weighted module to the GPUs architecture for the runtime row skipping in weight matrices. Moreover, our techniques are equipped with thresholds which provide a unique tunning space for performance-accuracy trade-offs directly guided by the user preferences. The experimental results show our optimizations achieves substantial improvements on both performance and power with user-imperceptible accuracy loss. And our optimizations exhibit the strong scalability with the increasing input data set. Our user study also shows that our designed system delivers the excellent user experience.

[1]  Jeff Pool,et al.  Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip , 2018, ICLR.

[2]  Gregory Frederick Diamos,et al.  Block-Sparse Recurrent Neural Networks , 2017, ArXiv.

[3]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[4]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[5]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[7]  Scott A. Mahlke,et al.  DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[9]  Yang Hu,et al.  Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[11]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[12]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[13]  Rana El Kaliouby,et al.  On the Future of Personal Assistants , 2016, CHI Extended Abstracts.

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Peter Norvig,et al.  Deep Learning with Dynamic Computation Graphs , 2017, ICLR.

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Tobi Delbrück,et al.  DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator , 2018, FPGA.

[18]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Todd Mytkowicz,et al.  Parallelizing dynamic programming through rank convergence , 2014, PPoPP '14.

[20]  Jing Wang,et al.  In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Meiyappan Nagappan,et al.  Future Trends in Software Engineering Research for Mobile Apps , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[22]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[23]  Eriko Nurvitadhi,et al.  Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[24]  Matti Siekkinen,et al.  Smartphone Energy Consumption: Modeling and Optimization , 2014 .

[25]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[30]  Reetuparna Das,et al.  Parallel automata processor , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[33]  Trevor N. Mudge,et al.  Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.

[34]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[35]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Wolfram Schulte,et al.  Data-parallel finite-state machines , 2014, ASPLOS.

[37]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[38]  Erich Elsen,et al.  Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.

[39]  Jose Maria,et al.  Energy-efficient mobile GPU systems , 2015 .

[40]  Phil Blunsom,et al.  Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.

[41]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.