论文信息 - PROMISE: An End-to-End Design of a Programmable Mixed-Signal Accelerator for Machine-Learning Algorithms

PROMISE: An End-to-End Design of a Programmable Mixed-Signal Accelerator for Machine-Learning Algorithms

Analog/mixed-signal machine learning (ML) accelerators exploit the unique computing capability of analog/mixed-signal circuits and inherent error tolerance of ML algorithms to obtain higher energy efficiencies than digital ML accelerators. Unfortunately, these analog/mixed-signal ML accelerators lack programmability, and even instruction set interfaces, to support diverse ML algorithms or to enable essential software control over the energy-vs-accuracy tradeoffs. We propose PROMISE, the first end-to-end design of a PROgrammable MIxed-Signal accElerator from Instruction Set Architecture (ISA) to high-level language compiler for acceleration of diverse ML algorithms. We first identify prevalent operations in widely-used ML algorithms and key constraints in supporting these operations for a programmable mixed-signal accelerator. Second, based on that analysis, we propose an ISA with a PROMISE architecture built with silicon-validated components for mixed-signal operations. Third, we develop a compiler that can take a ML algorithm described in a high-level programming language (Julia) and generate PROMISE code, with an IR design that is both language-neutral and abstracts away unnecessary hardware details. Fourth, we show how the compiler can map an application-level error tolerance specification for neural network applications down to low-level hardware parameters (swing voltages for each application Task) to minimize energy consumption. Our experiments show that PROMISE can accelerate diverse ML algorithms with energy efficiency competitive even with fixed-function digital ASICs for specific ML algorithms, and the compiler optimization achieves significant additional energy savings even for only 1% extra errors.

[1] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[2] Naresh R. Shanbhag,et al. An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Simha Sethumadhavan,et al. Hybrid Analog-Digital Solution of Nonlinear Partial Differential Equations , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Charbel Sakr,et al. An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Sigurd Wagner,et al. 16.2 A large-area image sensing and detection system based on embedded thin-film classifiers , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[6] Daisuke Miyashita,et al. Mixed-signal circuits for embedded machine-learning applications , 2015, 2015 49th Asilomar Conference on Signals, Systems and Computers.

[7] Martin C. Rinard,et al. Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[8] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[9] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[10] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11] Sujan Kumar Gonugondla,et al. A 19.4-nJ/Decision, 364-K Decisions/s, In-Memory Random Forest Multi-Class Inference Accelerator , 2018, IEEE Journal of Solid-State Circuits.

[12] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[14] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[15] Charbel Sakr,et al. Analytical Guarantees on Numerical Precision of Deep Neural Networks , 2017, ICML.

[16] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[17] Kathryn S. McKinley,et al. Uncertain: a first-order type for uncertain data , 2014, ASPLOS.

[18] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[19] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[20] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[21] Len Thomas,et al. A method for detecting whistles, moans, and other frequency contour sounds. , 2011, The Journal of the Acoustical Society of America.

[22] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[23] Sanu Mathew,et al. 14.4 A 21.5M-query-vectors/s 3.37nJ/vector reconfigurable k-nearest-neighbor accelerator with adaptive precision in 14nm tri-gate CMOS , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[24] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25] David Blaauw,et al. Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26] Sujan Kumar Gonugondla,et al. A 42pJ/decision 3.12TOPS/W robust in-memory machine learning classifier with on-chip training , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[27] Yuan Gao,et al. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28] Ninghui Sun,et al. DianNao family , 2016, Commun. ACM.

[29] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[30] Zhuo Wang,et al. In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array , 2017, IEEE Journal of Solid-State Circuits.

[31] Gu-Yeon Wei,et al. 14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[32] Roberto Brunelli,et al. Face Recognition: Features Versus Templates , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Naresh R. Shanbhag,et al. An energy-efficient memory-based high-throughput VLSI architecture for convolutional networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35] K. Kim,et al. Face recognition using kernel principal component analysis , 2002, IEEE Signal Process. Lett..

[36] Sujan Kumar Gonugondla,et al. A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array , 2018, IEEE Journal of Solid-State Circuits.

[37] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38] S. Simon Wong,et al. 24.2 A 2.5GHz 7.7TOPS/W switched-capacitor matrix multiplier with co-designed local memory in 40nm , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[39] Alan Edelman,et al. Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[40] Naresh R. Shanbhag,et al. A 19.4 nJ/decision 364K decisions/s in-memory random forest classifier in 6T SRAM array , 2017, ESSCIRC 2017 - 43rd IEEE European Solid State Circuits Conference.

[41] Siddharth Joshi,et al. 21.7 2pJ/MAC 14b 8×8 linear transform mixed-signal spatial filter in 65nm CMOS with 84dB interference suppression , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[42] Yann LeCun,et al. CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[43] Dan Grossman,et al. EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[44] Naresh R. Shanbhag,et al. In-Memory Computing Architectures for Sparse Distributed Memory , 2016, IEEE Transactions on Biomedical Circuits and Systems.

[45] Michael P. Flynn,et al. A 3.43TOPS/W 48.9pJ/pixel 50.1nJ/classification 512 analog neuron sparse coding neural network with on-chip learning and classification in 40nm CMOS , 2017, 2017 Symposium on VLSI Circuits.

[46] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[47] Trevor Mudge,et al. Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[48] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[49] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[50] Gert Cauwenberghs,et al. Kerneltron: Support Vector 'Machine' in Silicon , 2002, SVM.