RNSnet: In-Memory Neural Network Acceleration Using Residue Number System

We live in a world where technological advances are continually creating more data than what we can deal with. Machine learning algorithms, in particular Deep Neural Networks (DNNs), are essential to process such large data. Computation of DNNs requires loading the trained network on the processing element and storing the result in memory. Therefore, running these applications need a high memory bandwidth. Traditional cores are memory limited in terms of the memory bandwidth. Hence, running DNNs on traditional cores results in high energy consumption and slows down processing speed due to a large amount of data movement between memory and processing units. Several prior works tried to address data movement issue by enabling Processing In-Memory (PIM)using crossbar analog multiplication. However, these designs suffer from the large overhead of data conversion between analog and digital domains. In this work, we propose RNSnet, which uses Residue Number System (RNS)to execute neural network completely in the digital domain in memory. RNSnet simplifies the fundamental neural network operations and maps them to in-memory addition and data access. We test the efficiency of the proposed design on several popular neural network applications. Our experimental result shows that RNSnet consumes 145.5x less energy and obtains 35.4x speedup as compared to NVIDIA GPU GTX 1080. In addition, our results show that RNSnet can achieve 8.5 x higher energy-delay product as compared to the state-of-the-art neural network accelerators.

[1]  A. Hiasat,et al.  Residue-to-binary arithmetic converter for the moduli set (2/sup k/, 2/sup k/-1, 2/sup k-1/-1) , 1998 .

[2]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[3]  T. Serrano-Gotarredona,et al.  STDP and STDP variations with memristors for spiking neuromorphic learning systems , 2013, Front. Neurosci..

[4]  Eric B. Olsen RNS Hardware Matrix Multiplier for High Precision Neural Network Acceleration: "RNS TPU" , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[5]  Tadahiro Kuroda,et al.  BRein Memory: A Single-Chip Binary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator Achieving 1.4 TOPS at 0.6 W , 2018, IEEE Journal of Solid-State Circuits.

[6]  Farinaz Koushanfar,et al.  LookNN: Neural network with no multiplication , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[7]  Jan M. Rabaey,et al.  Exploring Hyperdimensional Associative Memory , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Sherief Reda,et al.  Understanding the impact of precision quantization on the accuracy and energy of neural networks , 2016, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[9]  Xiaoming Chen,et al.  Design and optimization of FeFET-based crossbars for binary convolution neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Sylvain Duquesne,et al.  A FPGA pairing implementation using the Residue Number System , 2011, IACR Cryptol. ePrint Arch..

[11]  Ricardo Chaves,et al.  Arithmetic-Based Binary-to-RNS Converter Modulo ${\{2^{n}{\pm}k\}}$ for $jn$ -bit Dynamic Range , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  Massoud Pedram,et al.  A hardware-friendly algorithm for scalable training and deployment of dimensionality reduction models on FPGA , 2018, 2018 19th International Symposium on Quality Electronic Design (ISQED).

[13]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[14]  Pavel Steffan,et al.  Efficient image processing application using residue number system , 2013, Proceedings of the 20th International Conference Mixed Design of Integrated Circuits and Systems - MIXDES 2013.

[15]  Chip-Hong Chang,et al.  A new algorithm for single residue digit error correction in Redundant Residue Number System , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[16]  Haohuan Fu,et al.  Optimizing Residue Number System on FPGA , 2016, 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData).

[17]  Mohsen Imani,et al.  Ultra-efficient processing in-memory for data intensive applications , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Farinaz Koushanfar,et al.  RAPIDNN: In-Memory Deep Neural Network Acceleration Framework , 2018, ArXiv.

[19]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Yuichiro Shibata,et al.  FPGA implementation of a real-time super-resolution system with a CNN based on a residue number system , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[21]  Tajana Simunic,et al.  Efficient query processing in crossbar memory , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[22]  Giuseppe Alia,et al.  NEUROM: a ROM based RNS digital neuron , 2005, Neural Networks.

[23]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24]  Sarat Kumar Sahoo,et al.  Implementation of floating point MAC using Residue Number System , 2014, 2014 International Conference on Reliability Optimization and Information Technology (ICROIT).

[25]  Tajana Simunic,et al.  FELIX: Fast and Energy-Efficient Logic in Memory , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Vivek K. Pallipuram,et al.  Acceleration of spiking neural networks in emerging multi-core and GPU architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[27]  Tajana Simunic,et al.  MASC: Ultra-low energy multiple-access single-charge TCAM for approximate computing , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[29]  Nishil Talati,et al.  Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC) , 2016, IEEE Transactions on Nanotechnology.

[30]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[31]  Masahiro Fujita,et al.  Systematic approximate logic optimization using don't care conditions , 2017, 2017 18th International Symposium on Quality Electronic Design (ISQED).

[32]  Lam Lay Yong,et al.  Fleeting Footsteps: Tracing the Conception of Arithmetic and Algebra in Ancient China , 1992 .

[33]  Tajana Simunic,et al.  GAS: A Heterogeneous Memory Architecture for Graph Processing , 2018, ISLPED.

[34]  Ameenudeen Pe,et al.  A novel method for error correction using Redundant Residue Number System in digital communication systems , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[35]  Rajesh K. Gupta,et al.  Energy-efficient neural networks using approximate computation reuse , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[37]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38]  Thomas M. Conte,et al.  Computationally-redundant energy-efficient processing for y'all (CREEPY) , 2016, 2016 IEEE International Conference on Rebooting Computing (ICRC).

[39]  Massoud Pedram,et al.  FFT-based deep learning deployment in embedded systems , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40]  Chip-Hong Chang,et al.  Fault-Tolerant Computing in Redundant Residue Number System , 2017 .

[41]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[42]  V. Krasnobayev,et al.  A Method for Arithmetic Comparison of Data Represented in a Residue Number System , 2016 .

[43]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[44]  P. Mohan Binary to Residue Conversion , 2016 .

[45]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[46]  Michael T. Niemier,et al.  Exploiting ferroelectric FETs for low-power non-volatile logic-in-memory circuits , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[47]  G.C. Cardarilli,et al.  Residue Number System for Low-Power DSP Applications , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[48]  Ian O'Connor,et al.  Computing with ferroelectric FETs: Devices, models, systems, and applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[49]  Tsutomu Sasao,et al.  A High-speed Low-power Deep Neural Network on an FPGA based on the Nested RNS: Applied to an Object Detector , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[50]  Tsutomu Sasao,et al.  A deep convolutional neural network based on nested residue number system , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[51]  Nikolay I. Chervyakov,et al.  Efficient implementation of modular multiplication by constants applied to RNS reverse converters , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[52]  A. Omondi,et al.  Residue Number Systems: Theory and Implementation , 2007 .

[53]  Viktor A. Kuchukov,et al.  Fast modular multiplication execution in residue number system , 2016, 2016 IEEE Conference on Quality Management, Transport and Information Security, Information Technologies (IT&MQ&IS).

[54]  Eby G. Friedman,et al.  VTEAM – A General Model for Voltage Controlled Memristors , 2014 .

[55]  Paulo Martins,et al.  Sign Detection and Number Comparison on RNS 3-Moduli Sets $$\{2^n-1, 2^{n+x}, 2^n+1\}$${2n-1,2n+x,2n+1} , 2017, Circuits Syst. Signal Process..

[56]  Tajana Simunic,et al.  ORCHARD: Visual object recognition accelerator based on approximate in-memory processing , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).