UGEMM: Unary Computing Architecture for GEMM Applications

General matrix multiplication (GEMM) is universal in various applications, such as signal processing, machine learning, and computer vision. Conventional GEMM hardware architectures based on binary computing exhibit low area and energy efficiency as they scale due to the spatial nature of number representation and computing. Unary computing, on the other hand, can be performed with extremely simple processing units, often just with a single logic gate. But currently there exist no efficient architectures for unary GEMM.In this paper, we present uGEMM, an area- and energy-efficient unary GEMM architecture enabled by novel arithmetic units. The proposed design relaxes previously-imposed constraints on input bit streams—low correlation and long stream length— and achieves superior area and energy efficiency over existing unary systems. Furthermore, uGEMM’s output bit streams exhibit higher accuracy and faster convergence, enabling dynamic energy-accuracy scaling on resource-constrained systems.

[1]  Armin Alaghi,et al.  Correlation manipulating circuits for stochastic computing , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[3]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[4]  Kiyoung Choi,et al.  Accurate and Efficient Stochastic Computing Hardware for Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[5]  M. Hassan Najafi,et al.  A Fast Fault-Tolerant Architecture for Sauvola Local Image Thresholding Algorithm Using Stochastic Computing , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Ji Li,et al.  HEIF: Highly Efficient Stochastic Computing-Based Inference Framework for Deep Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Kia Bazargan,et al.  Low-Cost Sorting Network Circuits Using Unary Processing , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  John P. Hayes,et al.  Exploiting correlation in stochastic circuit design , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[9]  Di Wu,et al.  In-Stream Stochastic Division and Square Root via Correlation , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[10]  Keshab K. Parhi,et al.  Computing Arithmetic Functions Using Stochastic Logic by Series Expansion , 2019, IEEE Transactions on Emerging Topics in Computing.

[11]  David J. Lilja,et al.  On Memory System Design for Stochastic Computing , 2018, IEEE Computer Architecture Letters.

[12]  David J. Lilja,et al.  Performing Stochastic Computation Deterministically , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  Kia Bazargan,et al.  Energy-Efficient Near-Sensor Convolution using Pulsed Unary Processing , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[14]  John P. Hayes,et al.  Design of Division Circuits for Stochastic Computing , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[15]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[16]  E. Adrian,et al.  The impulses produced by sensory nerve‐endings , 1926 .

[17]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[18]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[19]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[22]  E. Adrian,et al.  The impulses produced by sensory nerve-endings: Part II. The response of a Single End-Organ. , 2006, The Journal of physiology.

[23]  John P. Hayes,et al.  Energy-efficient hybrid stochastic-binary neural networks for near-sensor computing , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[24]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Jongeun Lee,et al.  Log-quantized stochastic computing for memory and computation efficient DNNs , 2019, ASP-DAC.

[26]  Mikko H. Lipasti,et al.  SECO: A Scalable Accuracy Approximate Exponential Function Via Cross-Layer Optimization , 2019, 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[27]  Jie Han,et al.  Energy efficient stochastic computing with Sobol sequences , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[28]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[29]  John P. Hayes,et al.  Fast and accurate computation using stochastic circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[30]  Kelvin E. Jones,et al.  Neuronal variability: noise or part of the signal? , 2005, Nature Reviews Neuroscience.

[31]  Naoya Onizawa,et al.  VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[32]  Alexander J. Groszewski,et al.  Deterministic Stochastic Computation Using Parallel Datapaths , 2019, 20th International Symposium on Quality Electronic Design (ISQED).

[33]  James Smith,et al.  Space-Time Algebra: A Model for Neocortical Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[34]  Chi-Hsiang Yeh,et al.  Accumulative parallel counters , 1995, Conference Record of The Twenty-Ninth Asilomar Conference on Signals, Systems and Computers.

[35]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[36]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[37]  Jongeun Lee,et al.  DPS: Dynamic Precision Scaling for Stochastic Computing-based Deep Neural Networks* , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[38]  Dmitri B. Strukov,et al.  Race Logic: A hardware acceleration for dynamic programming algorithms , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[39]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40]  Natalie D. Enright Jerger,et al.  The What's Next Intermittent Computing Architecture , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  Yeong-Luh Ueng,et al.  Strategies for Reducing Decoding Cycles in Stochastic LDPC Decoders , 2016, IEEE Transactions on Circuits and Systems II: Express Briefs.

[42]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jongeun Lee,et al.  A new stochastic computing multiplier with application to deep convolutional neural networks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[44]  Yangqing Jia,et al.  Learning Semantic Image Representations at a Large Scale , 2014 .

[45]  John P. Hayes,et al.  Stochastic circuits for real-time image-processing applications , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[46]  Younghyun Kim,et al.  SAADI: a scalable accuracy approximate divider for dynamic energy-quality scaling , 2019, ASP-DAC.

[47]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[48]  Dmitri B. Strukov,et al.  A 4-mm2 180-nm-CMOS 15-Giga-cell-updates-per-second DNA sequence alignment engine based on asynchronous race conditions , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).

[49]  Kia Bazargan,et al.  Power and Area Efficient Sorting Networks Using Unary Processing , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[50]  John P. Hayes,et al.  Survey of Stochastic Computing , 2013, TECS.

[51]  Shie Mannor,et al.  Stochastic decoding of LDPC codes , 2006, IEEE Communications Letters.

[52]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[53]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[54]  Natalie D. Enright Jerger,et al.  The Anytime Automaton , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[55]  Dmitri B. Strukov,et al.  Boosted Race Trees for Low Energy Classification , 2019, ASPLOS.

[56]  Kiyoung Choi,et al.  Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[57]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[58]  Qinru Qiu,et al.  SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing , 2016, ASPLOS.

[59]  Kiyoung Choi,et al.  Approximate de-randomizer for stochastic circuits , 2015, 2015 International SoC Design Conference (ISOCC).

[60]  Rajesh Gupta,et al.  Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[61]  Marc D. Riedel,et al.  A deterministic approach to stochastic computation , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[62]  Armin Alaghi,et al.  Architecture Considerations for Stochastic Computing Accelerators , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[63]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.