Manna: An Accelerator for Memory-Augmented Neural Networks

Memory-augmented neural networks (MANNs)-- which augment a traditional Deep Neural Network (DNN) with an external, differentiable memory-- are emerging as a promising direction in machine learning. MANNs have been shown to achieve one-shot learning and complex cognitive capabilities that are well beyond those of classical DNNs. We analyze the computational characteristics of MANNs and observe that they present a unique challenge due to soft reads and writes to the differentiable memory, each of which requires access to all the memory locations. This results in poor performance of MANNs on modern CPUs, GPUs, and other accelerators. To address this, we present Manna, a specialized hardware inference accelerator for MANNs. Manna is a memory-centric design that focuses on maximizing performance in an extremely low FLOPS/Byte context. The key architectural features from which Manna derives efficiency are: (i) investing most of the die area and power in highly banked on-chip memories that provide ample bandwidth rather than large matrix-multiply units that would be underutilized due to the low reuse (ii) a hardware-assisted transpose mechanism for accommodating the diverse memory access patterns observed in MANNs, (iii) a specialized processing tile that is equipped to handle the nearly-equal mix of MAC and non-MAC computations present in MANNs, and (iv) methods to map MANNs to Manna that minimize data movement while fully exploiting the little reuse present. We evaluate Manna by developing a detailed architectural simulator with timing and power models calibrated by synthesis to the 15 nm Nangate Open Cell library. Across a suite of 10 benchmarks, Manna demonstrates average speedups of 39x with average energy improvements of 122x over an NVIDIA 1080-Ti Pascal GPU and average speedups of 24x with average energy improvements of 86x over a state-of-the-art NVIDIA 2080-Ti Turing GPU.

[1]  Berin Martini,et al.  Recurrent Neural Networks Hardware Implementation on FPGA , 2015, ArXiv.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Eugenio Culurciello,et al.  Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks , 2017, ArXiv.

[6]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[8]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Rudolf Eigenmann,et al.  Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks , 2017, PPOPP.

[12]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[13]  Eugenio Culurciello,et al.  Snowflake: An efficient hardware accelerator for convolutional neural networks , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[14]  Jaewon Lee,et al.  MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[15]  Eugenio Culurciello,et al.  Hardware accelerators for recurrent neural networks on FPGA , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[16]  Sungroh Yoon,et al.  Energy-Efficient Inference Accelerator for Memory-Augmented Neural Networks on an FPGA , 2018, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[18]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[19]  Zhongfeng Wang,et al.  Accelerating Recurrent Neural Networks: A Memory-Efficient Approach , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[21]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Anand Raghunathan,et al.  X-MANN: A Crossbar based Architecture for Memory Augmented Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[26]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[28]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[29]  Jung Ho Ahn,et al.  CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[30]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[31]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[32]  Yu Wang,et al.  FPGA Acceleration of Recurrent Neural Network Based Language Model , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[33]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[34]  Vivienne Sze,et al.  Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks , 2018, ArXiv.

[35]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[36]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[37]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[38]  Eriko Nurvitadhi,et al.  Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).