MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks

Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, memory networks are known for their huge reasoning power and capability to learn from a large number of inputs rather than other networks. As the size of input datasets rapidly grows, the necessity of large-scale memory networks continuously arises. Such large-scale memory networks provide excellent reasoning power; however, the current computer infrastructure cannot achieve scalable performance due to its limited system architecture. In this paper, we propose MnnFast, a novel system architecture for large-scale memory networks to achieve fast and scalable reasoning performance. We identify the performance problems of the current architecture by conducting extensive performance bottleneck analysis. Our in-depth analysis indicates that the current architecture suffers from three major performance problems: high memory bandwidth consumption, heavy computation, and cache contention. To overcome these performance problems, we propose three novel optimizations. First, to reduce the memory bandwidth consumption, we propose a new column-based algorithm with streaming which minimizes the size of data spills and hides most of the off-chip memory accessing overhead. Second, to decrease the high computational overhead, we propose a zero-skipping optimization to bypass a large amount of output computation. Lastly, to eliminate the cache contention, we propose an embedding cache dedicated to efficiently cache the embedding matrix. Our evaluations show that MnnFast is significantly effective in various types of hardware: CPU, GPU, and FPGA. MnnFast improves the overall throughput by up to 5.38×, 4.34×, and 2.01× on CPU, GPU, and FPGA respectively. Also, compared to CPU-based MnnFast, our FPGA-based MnnFast achieves 6.54× higher energy efficiency.

[1]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[2]  Diana Marculescu,et al.  Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks , 2018, ArXiv.

[3]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4]  André Seznec,et al.  Practical data value speculation for future high-end processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[6]  Boris Murmann,et al.  A Pixel Pitch-Matched Ultrasound Receiver for 3-D Photoacoustic Imaging With Integrated Delta-Sigma Beamformer in 28-nm UTBB FD-SOI , 2017, IEEE Journal of Solid-State Circuits.

[7]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  David A. Wood,et al.  LogCA: A high-level performance model for hardware accelerators , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[10]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[11]  Gianluca Palermo,et al.  mARGOt: A Dynamic Autotuning Framework for Self-Aware Approximate Computing , 2019, IEEE Transactions on Computers.

[12]  Eric Cheng,et al.  Very Low Voltage (VLV) Design , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Gu-Yeon Wei,et al.  On-Chip Deep Neural Network Storage with Multi-Level eNVM , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[16]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Jason Weston,et al.  Dialog-based Language Learning , 2016, NIPS.

[18]  Tao Li,et al.  Prediction Based Execution on Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[19]  Yu Cao,et al.  Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Dan Grossman,et al.  Probability type inference for flexible approximate programming , 2015, OOPSLA.

[21]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[22]  Dan Alistarh,et al.  Distributed Learning over Unreliable Networks , 2018, ICML.

[23]  Meng-Fan Chang,et al.  DL-RSIM: A Simulation Framework to Enable Reliable ReRAM-based Accelerators for Deep Learning , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[24]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[25]  Brandon Lucia,et al.  Intelligence Beyond the Edge: Inference on Intermittent Embedded Systems , 2018, ASPLOS.

[26]  David Wentzlaff,et al.  Scaling Datacenter Accelerators with Compute-Reuse Architectures , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[27]  Tze Meng Low,et al.  SPIRAL: Extreme Performance Portability , 2018, Proceedings of the IEEE.

[28]  Diana Marculescu,et al.  HyperPower: Power- and memory-constrained hyper-parameter optimization for neural networks , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  Diana Marculescu,et al.  LightNN: Filling the Gap between Conventional Deep Neural Networks and Binarized Networks , 2017, ACM Great Lakes Symposium on VLSI.

[30]  Gu-Yeon Wei,et al.  Cognitive Computing Safety: The New Horizon for Reliability / The Design and Evolution of Deep Learning Workloads , 2017, IEEE Micro.

[31]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[32]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[33]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[34]  Massoud Pedram,et al.  VIBNN: Hardware Acceleration of Bayesian Neural Networks , 2018, ASPLOS.

[35]  Christopher W. Fletcher,et al.  Morph: Flexible Acceleration for 3D CNN-Based Video Understanding , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[37]  Hyesoon Kim,et al.  StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIM , 2018, IEEE Transactions on Computers.

[38]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[39]  Tao Li,et al.  Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40]  Diana Marculescu,et al.  Hardware-Aware Machine Learning: Modeling and Optimization , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[41]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42]  Martin Dyer,et al.  Leibniz International Proceedings in Informatics, LIPIcs , 2016, ICALP 2016.

[43]  Jose-Maria Arnau,et al.  The Dark Side of DNN Pruning , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[44]  Pradip Bose,et al.  Impact of Software Approximations on the Resiliency of a Video Summarization System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[45]  Abdullah Muzahid,et al.  Approximeter: Automatically finding and quantifying code sections for approximation , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[46]  David Blaauw,et al.  Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[47]  Gu-Yeon Wei,et al.  A case for efficient accelerator design space exploration via Bayesian optimization , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[48]  Yiran Chen,et al.  A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Onur Mutlu,et al.  RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads , 2016, ACM Trans. Archit. Code Optim..

[50]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[51]  Gu-Yeon Wei,et al.  The Aladdin Approach to Accelerator Design and Modeling , 2015, IEEE Micro.

[52]  Jing Wang,et al.  In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[53]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[54]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[55]  Lingjia Tang,et al.  The Architectural Implications of Autonomous Driving: Constraints and Acceleration , 2018, ASPLOS.

[56]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[57]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[58]  Alexander M. Rush,et al.  Weightless: Lossy Weight Encoding For Deep Neural Network Compression , 2018, ICML.

[59]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[60]  Yiran Chen,et al.  ReCom: An efficient resistive accelerator for compressed deep neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[61]  Suren Jayasuriya,et al.  EVA²: Exploiting Temporal Redundancy in Live Computer Vision , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[62]  Xiang Zhang,et al.  Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems , 2015, ICLR.

[63]  Jose-Maria Arnau,et al.  Computation Reuse in DNNs by Exploiting Input Similarity , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[64]  Daehyun Kim,et al.  μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization , 2019, EuroSys.

[65]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[66]  Saibal Mukhopadhyay,et al.  A programmable hardware accelerator for simulating dynamical systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[67]  Scott A. Mahlke,et al.  DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[69]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[70]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[71]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[72]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[73]  Mark Davies The Corpus of Contemporary American English (COCA) , 2012 .

[74]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[75]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[76]  Engin Ipek,et al.  Enabling Scientific Computing on Memristive Accelerators , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[77]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[78]  Diana Marculescu,et al.  Designing Adaptive Neural Networks for Energy-Constrained Image Classification , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[79]  Jose-Maria Arnau,et al.  UNFOLD: A Memory-Efficient Speech Recognizer Using On-The-Fly WFST Composition , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[80]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[81]  Luis Ceze,et al.  Hardware-Software Co-Design: Not Just a Cliché , 2015, SNAPL.

[82]  Ying Ma,et al.  A Taxonomy for Neural Memory Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[83]  Engin Ipek,et al.  Making Memristive Neural Network Accelerators Reliable , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).