Deep In-Memory Architectures for Machine Learning–Accuracy Versus Efficiency Trade-Offs

In-memory architectures, in particular, the <italic>deep in-memory architecture</italic> (DIMA) has emerged as an attractive alternative to the traditional von Neumann (digital) architecture for realizing energy and latency-efficient machine learning systems in silicon. Multiple DIMA integrated circuit (IC) prototypes have demonstrated energy-delay product (EDP) gains of up to <inline-formula> <tex-math notation="LaTeX">$100\times $ </tex-math></inline-formula> over a digital architecture. These EDP gains were achieved <italic>minimal</italic> or sometimes <italic>no loss</italic> in decision-making accuracy which is surprising given its intrinsic analog mixed-signal nature. This paper establishes models and methods to understand the fundamental energy-delay and accuracy trade-offs underlying DIMA by: 1) presenting silicon-validated energy, delay, and accuracy models; and 2) employing these to quantify DIMA’s decision-level accuracy and to identify the most effective design parameters to maximize its EDP gains at a given level of accuracy. For example, it is shown that: 1) DIMA has the potential to realize between <inline-formula> <tex-math notation="LaTeX">$21\times $ </tex-math></inline-formula>-to-<inline-formula> <tex-math notation="LaTeX">$1365\times $ </tex-math></inline-formula> gains; 2) its energy-per-decision is approximately <inline-formula> <tex-math notation="LaTeX">$10\times $ </tex-math></inline-formula> lower at the same decision-making accuracy under most conditions; 3) its accuracy can always be improved by increasing the input vector dimension and/or by increasing the bitline swing; and 4) unlike the digital architecture, there are quantifiable conditions under which DIMA’s accuracy is fundamentally limited due to noise.

[1]  Sujan Kumar Gonugondla,et al.  A 19.4-nJ/Decision, 364-K Decisions/s, In-Memory Random Forest Multi-Class Inference Accelerator , 2018, IEEE Journal of Solid-State Circuits.

[2]  Marian Verhelst,et al.  An always-on 3.8μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[3]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[4]  David Blaauw,et al.  SRAM for Error-Tolerant Applications With Dynamic Energy-Quality Management in 28 nm CMOS , 2015, IEEE Journal of Solid-State Circuits.

[5]  Zhuo Wang,et al.  In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array , 2017, IEEE Journal of Solid-State Circuits.

[6]  Bankman Daniel,et al.  An 8-bit, 16 input, 3.2 pJ/op switched-capacitor dot product circuit in 28-nm FDSOI CMOS , 2016 .

[7]  Naresh R. Shanbhag,et al.  An energy-efficient memory-based high-throughput VLSI architecture for convolutional networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Gu-Yeon Wei,et al.  14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[9]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[10]  Anantha Chandrakasan,et al.  Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[11]  Sujan Kumar Gonugondla,et al.  A Variation-Tolerant In-Memory Machine Learning Classifier via On-Chip Training , 2018, IEEE Journal of Solid-State Circuits.

[12]  Sujan Kumar Gonugondla,et al.  An In-Memory VLSI Architecture for Convolutional Neural Networks , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[13]  Naveen Verma,et al.  Overcoming Computational Errors in Sensing Platforms Through Embedded Machine-Learning Kernels , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[14]  Marian Verhelst,et al.  14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[15]  Ying Chen,et al.  Characterization of SRAM sense amplifier input offset for yield prediction in 28nm CMOS , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Naveen Verma Analysis Towards Minimization of Total SRAM Energy Over Active and Idle Operating Modes , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Naresh R. Shanbhag,et al.  SRAM Bit-line Swings Optimization using Generalized Waterfilling , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[19]  Naresh R. Shanbhag,et al.  Generalized Water-Filling for Source-Aware Energy-Efficient SRAMs , 2017, IEEE Transactions on Communications.

[20]  Naveen Verma,et al.  18.4 A matrix-multiplying ADC implementing a machine-learning classifier directly with data conversion , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[21]  Naveen Verma,et al.  A Low-Energy Machine-Learning Classifier Based on Clocked Comparators for Direct Inference on Analog Sensors , 2017, IEEE Transactions on Circuits and Systems I: Regular Papers.

[22]  David Blaauw,et al.  14.2 A Compute SRAM with Bit-Serial Integer/Floating-Point Operations for Programmable In-Memory Vector Acceleration , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[23]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[24]  Marian Verhelst,et al.  5 ENVISION : A 0 . 26-to-10 TOPS / W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28 nm FDSOI , 2017 .

[25]  Marian Verhelst,et al.  Bit Error Tolerance of a CIFAR-10 Binarized Convolutional Neural Network Processor , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[26]  Jae-sun Seo,et al.  XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks , 2018, 2018 IEEE Symposium on VLSI Technology.

[27]  David Blaauw,et al.  Approximate SRAMs With Dynamic Energy-Quality Management , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Hossein Valavi,et al.  A Mixed-Signal Binarized Convolutional-Neural-Network Accelerator Integrating Dense Weight Storage and Multiplication for Reduced Data Movement , 2018, 2018 IEEE Symposium on VLSI Circuits.

[29]  Sujan Kumar Gonugondla,et al.  A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array , 2018, IEEE Journal of Solid-State Circuits.

[30]  James R. Glass,et al.  14.4 A scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[31]  Naresh R. Shanbhag,et al.  An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).