DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

DNNs are one of the most widely used Deep Learning models. The matrix multiplication operations for DNNs incur significant computational costs and are bottlenecked by data movement between the memory and the processing elements. Many specialized accelerators have been proposed to optimize matrix multiplication operations. One popular idea is to use Processing-in-Memory where computations are performed by the memory storage element, thereby reducing the overhead of data movement between processor and memory. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations which have significant performance overhead and scalability issues. In this work, an in-SRAM digital multiplier is proposed to take the best of both worlds, i.e. performing GEMM in memory but using only conventional SRAMs without the drawbacks of bit-serial computations. This allows the user to design systems with significant performance gains using existing technologies with little to no modifications. We first design a novel approximate bit-parallel multiplier that approximates multiplications with bitwise OR operations by leveraging multiple wordlines activation in the SRAM. We then propose DAISM - Digital Approximate In-SRAM Multiplier architecture, an accelerator for convolutional neural networks, based on our novel multiplier. This is followed by a comprehensive analysis of trade-offs in area, accuracy, and performance. We show that under similar design constraints, DAISM reduces energy consumption by 25\% and the number of cycles by 43\% compared to state-of-the-art baselines.

[1]  Joo-Young Kim,et al.  T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training , 2023, IEEE Journal of Solid-State Circuits.

[2]  Leibo Liu,et al.  SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration , 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Masaaki Kondo,et al.  FAWS: Fault-Aware Weight Scheduler for DNN Computations in Heterogeneous and Faulty Hardware , 2022, 2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).

[4]  Marc Riera,et al.  A Survey of Near-Data Processing Architectures for Neural Networks , 2021, Mach. Learn. Knowl. Extr..

[5]  Sai Qian Zhang,et al.  FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding , 2021, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[6]  Joel Silberman,et al.  RaPiD: AI Accelerator for Ultra-low Precision Training and Inference , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[7]  P. B. Natrajan,et al.  Approximate Multiplier Design with Encoded Partial Products , 2021, 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS).

[8]  Juhyoung Lee,et al.  Z-PIM: A Sparsity-Aware Processing-in-Memory Architecture With Fully Variable Weight Bit-Precision for Energy-Efficient Deep Neural Networks , 2021, IEEE Journal of Solid-State Circuits.

[9]  Tao Luo,et al.  Energy Efficient In-memory Integer Multiplication Based on Racetrack Memory , 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[10]  Cheol-Won Jo,et al.  Bit-Serial multiplier based Neural Processing Element with Approximate adder tree , 2020, 2020 International SoC Design Conference (ISOCC).

[11]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[12]  David Blaauw,et al.  A 28-nm Compute SRAM With Bit-Serial Logic/Arithmetic Operations for Programmable In-Memory Vector Computing , 2020, IEEE Journal of Solid-State Circuits.

[13]  Vivienne Sze,et al.  Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[14]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[15]  Sander Stuijk,et al.  Near-Memory Computing: Past, Present, and Future , 2019, Microprocess. Microsystems.

[16]  Neil Burgess,et al.  Bfloat16 Processing for Neural Networks , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[17]  Shinji Kimura,et al.  Design of Power and Area Efficient Lower-Part-OR Approximate Multiplier , 2018, TENCON 2018 - 2018 IEEE Region 10 Conference.

[18]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[19]  Xin He,et al.  AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference , 2018, ISLPED.

[20]  David Blaauw,et al.  Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21]  Sparsh Mittal,et al.  A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks , 2018, Mach. Learn. Knowl. Extr..

[22]  Yiran Chen,et al.  ReRAM-based accelerator for deep learning , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  Joel Emer,et al.  A method to estimate the energy consumption of deep neural networks , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.

[24]  Maurizio Valle,et al.  Approximate Multipliers Based on Inexact Adders for Energy Efficient Data Processing , 2017, 2017 New Generation of CAS (NGCAS).

[25]  Bernard Girau,et al.  Fault and Error Tolerance in Neural Networks: A Review , 2017, IEEE Access.

[26]  David Blaauw,et al.  A 0.3V VDDmin 4+2T SRAM for searching and in-memory computing using 55nm DDC technology , 2017, 2017 Symposium on VLSI Circuits.

[27]  Alexandre Yakovlev,et al.  Energy-efficient approximate multiplier design using bit significance-driven logic compression , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[28]  David Blaauw,et al.  Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Fabrizio Lombardi,et al.  Design and Performance Evaluation of Approximate Floating-Point Multipliers , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[30]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  V. Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2016, IEEE Journal of Solid-State Circuits.

[32]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[33]  Qiang Xu,et al.  ApproxANN: An approximate computing framework for artificial neural network , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Henk Corporaal,et al.  Memristor based computation-in-memory architecture for data-intensive applications , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Andrew B. Kahng,et al.  CACTI-IO: CACTI with off-chip power-area-timing models , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[37]  X. Liao,et al.  ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training , 2021, IEEE Transactions on Parallel and Distributed Systems.

[38]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .