A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that combines high computational throughput with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly being adapted to support the improved functionalities. Hardware designers can refer to a myriad of accelerator implementations in the literature to evaluate and compare hardware design choices. However, the sheer number of publications and their diverse optimization directions hinder an effective assessment. Existing surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effects of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. In contrast to previous surveys, this work provides a quantitative overview of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. The list of optimizations and their quantitative effects are presented as a construction kit, allowing to assess the design choices for each building block individually. Reported optimizations range from up to 10’000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators.

[1]  Yandong Luo,et al.  Robust Processing-In-Memory With Multibit ReRAM Using Hessian-Driven Mixed-Precision Computation , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[2]  Michael W. Mahoney,et al.  A Survey of Quantization Methods for Efficient Neural Network Inference , 2021, Low-Power Computer Vision.

[3]  L. Benini,et al.  CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration With Better-Than-Binary Energy Efficiency , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Doe Hyun Yoon,et al.  The Design Process for Google's Training Chips: TPUv2 and TPUv3 , 2021, IEEE Micro.

[5]  Marian Verhelst,et al.  High-Utilization, High-Flexibility Depth-First CNN Coprocessor for Image Pixel Processing on FPGA , 2021, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  H. Lv,et al.  24.2 A 14nm-FinFET 1Mb Embedded 1T1R RRAM with a 0.022µ m2 Cell Size Using Self-Adaptive Delayed Termination and Multi-Cell Reference , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[7]  Chung-Chuan Lo,et al.  16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[8]  Sung Kyu Lim,et al.  Heterogeneous Mixed-Signal Monolithic 3-D In-Memory Computing Using Resistive RAM , 2021, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Dan Alistarh,et al.  Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..

[10]  Boris Murmann,et al.  Mixed-Signal Computing for Deep Neural Network Inference , 2021, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Improving Memory Utilization in Convolutional Neural Network Accelerators , 2020, IEEE Embedded Systems Letters.

[12]  L. Sousa Nonconventional Computer Arithmetic Circuits, Systems and Applications , 2021, IEEE Circuits and Systems Magazine.

[13]  Muhammad Shafique,et al.  Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead , 2020, IEEE Access.

[14]  Hoi-Jun Yoo,et al.  The Development of Silicon for AI: Different Design Approaches , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[15]  D. Blaauw,et al.  A$\mu$Processor Layer for mm-Scale Die-Stacked Sensing Platforms Featuring Ultra-Low Power Sleep Mode at 125°C , 2020, 2020 IEEE Asian Solid-State Circuits Conference (A-SSCC).

[16]  Oliver Bringmann,et al.  UltraTrail: A Configurable Ultralow-Power TC-ResNet AI Accelerator for Efficient Keyword Spotting , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[17]  Shouyi Yin,et al.  Efficient Scheduling of Irregular Network Structures on CNN Accelerators , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Luca Benini,et al.  Modular Design and Optimization of Biomedical Applications for Ultralow Power Heterogeneous Platforms , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  C. Ilas,et al.  Towards real-time and real-life image classification and detection using CNN: a review of practical applications requirements, algorithms, hardware and current trends , 2020, 2020 IEEE 26th International Symposium for Design and Technology in Electronic Packaging (SIITME).

[20]  Xiaoming Li,et al.  Fast Convolutional Neural Networks with Fine-Grained FFTs , 2020, PACT.

[21]  Jaeyoung Park,et al.  Neuromorphic Computing Using Emerging Synaptic Devices: A Retrospective Summary and an Outlook , 2020, Electronics.

[22]  Jeremy Kepner,et al.  Survey of Machine Learning Accelerators , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[23]  Honglan Jiang,et al.  Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications , 2020, Proceedings of the IEEE.

[24]  Eunhyeok Park,et al.  McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge , 2020, IEEE Access.

[25]  Shapeshifter Networks: Decoupling Layers from Parameters for Scalable and Effective Deep Learning. , 2020, 2006.10598.

[26]  Fabien Clermidy,et al.  SamurAI: A 1.7MOPS-36GOPS Adaptive Versatile IoT Node with 15,000× Peak-to-Idle Power Reduction, 207ns Wake-Up Time and 1.3TOPS/W ML Efficiency , 2020, 2020 IEEE Symposium on VLSI Circuits.

[27]  Pedram Pad,et al.  Efficient Neural Vision Systems Based on Convolutional Image Acquisition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sven Beyer,et al.  FeFET: A versatile CMOS compatible device with game-changing potential , 2020, 2020 IEEE International Memory Workshop (IMW).

[29]  J. Crowcroft,et al.  Edge Intelligence: Architectures, Challenges, and Applications , 2020 .

[30]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[31]  David Patterson,et al.  Benchmarking TinyML Systems: Challenges and Direction , 2020, ArXiv.

[32]  Xiaochen Peng,et al.  Compute-in-Memory with Emerging Nonvolatile-Memories: Challenges and Prospects , 2020, 2020 IEEE Custom Integrated Circuits Conference (CICC).

[33]  Yiran Chen,et al.  A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[34]  Massimo Alioto,et al.  Low-Energy Voice Activity Detection via Energy-Quality Scaling From Data Conversion to Machine Learning , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[35]  Nirali R. Nanavati,et al.  Efficient Hardware Implementations of Deep Neural Networks: A Survey , 2020, 2020 Fourth International Conference on Inventive Systems and Control (ICISC).

[36]  Matthew Mattina,et al.  Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference , 2020, IEEE Computer Architecture Letters.

[37]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[38]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[39]  Massimo Alioto,et al.  Energy-Quality Scalable Memory-Frugal Feature Extraction for Always-On Deep Sub-mW Distributed Vision , 2020, IEEE Access.

[40]  Pijush Kanti Dutta Pramanik,et al.  Power Consumption Analysis, Measurement, Management, and Issues: A State-of-the-Art Review of Smartphone Battery and Energy Usage , 2019, IEEE Access.

[41]  Daniele Paolo Scarpazza,et al.  Dissecting the Graphcore IPU Architecture via Microbenchmarking , 2019, ArXiv.

[42]  S. O. Park,et al.  1Gbit High Density Embedded STT-MRAM in 28nm FDSOI Technology , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[43]  Xiaochen Peng,et al.  DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[44]  Vivienne Sze,et al.  Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[45]  Yandong Luo,et al.  Monolithically Integrated RRAM- and CMOS-Based In-Memory Computing Optimizations for Efficient Deep Learning , 2019, IEEE Micro.

[46]  Christian Enz,et al.  Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[47]  Kaushik Roy,et al.  Towards spike-based machine intelligence with neuromorphic computing , 2019, Nature.

[48]  Hyeryung Jang,et al.  An Introduction to Probabilistic Spiking Neural Networks: Probabilistic Models, Learning Rules, and Applications , 2019, IEEE Signal Processing Magazine.

[49]  Luc Van Gool,et al.  AI Benchmark: All About Deep Learning on Smartphones in 2019 , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[50]  Matthias Bethge,et al.  Engineering a Less Artificial Intelligence , 2019, Neuron.

[51]  Sen Song,et al.  Bridging Biological and Artificial Neural Networks with Emerging Neuromorphic Devices: Fundamentals, Progress, and Challenges , 2019, Advanced materials.

[52]  Jeremy Kepner,et al.  Survey and Benchmarking of Machine Learning Accelerators , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[53]  Martin Trentzsch,et al.  Design and Analysis of an Ultra-Dense, Low-Leakage, and Fast FeFET-Based Random Access Memory Array , 2019, IEEE Journal on Exploratory Solid-State Computational Devices and Circuits.

[54]  Zhengya Zhang,et al.  A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations , 2019, Nature Electronics.

[55]  Hugo Van hamme,et al.  18μW SoC for near-microphone Keyword Spotting and Speaker Verification , 2019, 2019 Symposium on VLSI Circuits.

[56]  Xu Chen,et al.  Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing , 2019, Proceedings of the IEEE.

[57]  Jeremy Kepner,et al.  AI Enabling Technologies: A Survey , 2019, ArXiv.

[58]  Hoi-Jun Yoo,et al.  An Ultra-Low-Power Analog-Digital Hybrid CNN Face Recognition Processor Integrated with a CIS for Always-on Mobile Devices , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[59]  George K. Thiruvathukal,et al.  Low-Power Computer Vision: Status, Challenges, and Opportunities , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[60]  Andreas Burg,et al.  A 0.5 V 2.5 μW/MHz Microcontroller with Analog-Assisted Adaptive Body Bias PVT Compensation with 3.13nW/kB SRAM Retention in 55nm Deeply-Depleted Channel CMOS , 2019, 2019 IEEE Custom Integrated Circuits Conference (CICC).

[61]  Marian Verhelst,et al.  Breaking High-Resolution CNN Bandwidth Barriers With Enhanced Depth-First Execution , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[62]  Meng-Fan Chang,et al.  24.5 A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[63]  David Blaauw,et al.  IoT2 — the Internet of Tiny Things: Realizing mm-Scale Sensors through 3D Die Stacking , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[64]  Jack Xin,et al.  Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets , 2019, ICLR.

[65]  Luca Benini,et al.  Optimally Scheduling CNN Convolutions for Efficient Memory Access , 2019, ArXiv.

[66]  Pulkit Jain,et al.  13.3 A 7Mb STT-MRAM in 22FFL FinFET Technology with 4ns Read Sensing Time at 0.9V Using Write-Verify-Write Scheme and Offset-Cancellation Sensing Technique , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[67]  Meng-Fan Chang,et al.  24.1 A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[68]  Pulkit Jain,et al.  13.2 A 3.6Mb 10.1Mb/mm2 Embedded Non-Volatile ReRAM Macro in 22nm FinFET Technology with Adaptive Forming/Set/Reset Schemes Yielding Down to 0.5V with Sensing Time of 5ns at 0.7V , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[69]  Hoi-Jun Yoo,et al.  UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[70]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[71]  Xindong Wu,et al.  Object Detection With Deep Learning: A Review , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[72]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[73]  Luca Benini,et al.  Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[74]  Alessandro Aimar,et al.  NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[75]  Tayfun Gokmen,et al.  The Next Generation of Deep Learning Hardware: Analog Computing , 2019, Proceedings of the IEEE.

[76]  S. H. Han,et al.  Demonstration of Highly Manufacturable STT-MRAM Embedded in 28nm Logic , 2018, 2018 IEEE International Electron Devices Meeting (IEDM).

[77]  Carlos H. Diaz,et al.  A 40nm Low-Power Logic Compatible Phase Change Memory Technology , 2018, 2018 IEEE International Electron Devices Meeting (IEDM).

[78]  Massimo Alioto,et al.  Energy-Quality Scalable Integrated Circuits and Systems: Continuing Energy Scaling in the Twilight of Moore’s Law , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[79]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[80]  Andreas Gerstlauer,et al.  DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[81]  Sparsh Mittal,et al.  A survey of FPGA-based accelerators for convolutional neural networks , 2018, Neural Computing and Applications.

[82]  Alexander Fish,et al.  A 14.3pW Sub-Threshold 2T Gain-Cell eDRAM for Ultra-Low Power IoT Applications in 28nm FD-SOI , 2018, 2018 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S).

[83]  Hoi-Jun Yoo,et al.  DNPU: An Energy-Efficient Deep-Learning Processor with Heterogeneous Multi-Core Architecture , 2018, IEEE Micro.

[84]  Dylan Malone Stuart,et al.  Memory Requirements for Convolutional Neural Network Hardware Accelerators , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[85]  Masoud Dehyadegari,et al.  Designing Efficient Imprecise Adders using Multi-bit Approximate Building Blocks , 2018, ISLPED.

[86]  Luca Benini,et al.  XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[87]  Hui Liu,et al.  On-Demand Deep Model Compression for Mobile Devices: A Usage-Driven Model Selection Framework , 2018, MobiSys.

[88]  Marian Verhelst,et al.  Bit Error Tolerance of a CIFAR-10 Binarized Convolutional Neural Network Processor , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[89]  Sujan Kumar Gonugondla,et al.  An In-Memory VLSI Architecture for Convolutional Neural Networks , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[90]  Boris Murmann,et al.  BinarEye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS , 2018, 2018 IEEE Custom Integrated Circuits Conference (CICC).

[91]  Alexander Fish,et al.  A 4-Transistor nMOS-Only Logic-Compatible Gain-Cell Embedded DRAM With Over 1.6-ms Retention Time at 700 mV in 28-nm FD-SOI , 2018, IEEE Transactions on Circuits and Systems I: Regular Papers.

[92]  Tadahiro Kuroda,et al.  QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[93]  Marian Verhelst,et al.  An always-on 3.8μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[94]  Chung-Cheng Chou,et al.  An N40 256K×44 embedded RRAM macro with SL-precharge SA and low-voltage current limiter to improve read and write performance , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[95]  Shimeng Yu,et al.  Neuro-Inspired Computing With Emerging Nonvolatile Memorys , 2018, Proceedings of the IEEE.

[96]  Vikas Chandra,et al.  CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , 2018, ArXiv.

[97]  Hong Wang,et al.  Loihi: A Neuromorphic Manycore Processor with On-Chip Learning , 2018, IEEE Micro.

[98]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[99]  Steven J. Plimpton,et al.  Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator , 2017, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[100]  Jia Deng,et al.  Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution , 2017, AAAI.

[101]  Luca Benini,et al.  YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[102]  Hoi-Jun Yoo,et al.  A Low-Power Convolutional Neural Network Face Recognition Processor and a CIS Integrated With Always-on Face Detector , 2018, IEEE Journal of Solid-State Circuits.

[103]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[104]  M. Pons,et al.  PVT compensation in Mie Fujitsu 55 nm DDC: A standard-cell library based comparison , 2017, 2017 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S).

[105]  Huazhong Yang,et al.  CORAL: Coarse-grained reconfigurable architecture for Convolutional Neural Networks , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[106]  Yongqiang Lyu,et al.  Approximate Computing for Low Power and Security in the Internet of Things , 2017, Computer.

[107]  H. T. Kung,et al.  Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[108]  O. Weber FDSOI vs FinFET: differentiating device features for ultra low power & IoT applications , 2017, 2017 IEEE International Conference on IC Design and Technology (ICICDT).

[109]  Catherine D. Schuman,et al.  A Survey of Neuromorphic Computing and Neural Networks in Hardware , 2017, ArXiv.

[110]  Hoi-Jun Yoo,et al.  An energy-efficient deep learning processor with heterogeneous multi-core architecture for convolutional neural networks and recurrent neural networks , 2017, 2017 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS).

[111]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[112]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[113]  Luca Benini,et al.  CBinfer: Change-Based Inference for Convolutional Neural Networks on Video Data , 2017, ICDSC.

[114]  Mingyu Gao,et al.  TETRIS , 2017 .

[115]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[116]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[117]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[118]  Zhuo Wang,et al.  In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array , 2017, IEEE Journal of Solid-State Circuits.

[119]  Marian Verhelst,et al.  14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[120]  David Blaauw,et al.  14.7 A 288µW programmable deep-learning processor with 270KB on-chip weight storage using non-uniform memory hierarchy for mobile intelligence , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[121]  Youchang Kim,et al.  14.6 A 0.62mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on haar-like face detector , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[122]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[123]  Vivienne Sze,et al.  Hardware for machine learning: Challenges and opportunities , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).

[124]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[125]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[126]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[127]  V. Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2016, IEEE Journal of Solid-State Circuits.

[128]  Qian Wang,et al.  A novel data format for approximate arithmetic computing , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[129]  J. Mazurier,et al.  22nm FDSOI technology for emerging mobile, Internet-of-Things, and RF applications , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[130]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[131]  An Chen,et al.  A review of emerging non-volatile memory (NVM) technologies and applications , 2016 .

[132]  Boris Murmann,et al.  An 8-bit, 16 input, 3.2 pJ/op switched-capacitor dot product circuit in 28-nm FDSOI CMOS , 2016, 2016 IEEE Asian Solid-State Circuits Conference (A-SSCC).

[133]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[134]  Luca Benini,et al.  Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement , 2016, TODE.

[135]  Jean-Luc Nagel,et al.  Sub-threshold latch-based icyflex2 32-bit processor with wide supply range operation , 2016, 2016 46th European Solid-State Device Research Conference (ESSDERC).

[136]  Benton H. Calhoun,et al.  A 55nm Ultra Low Leakage Deeply Depleted Channel technology optimized for energy minimization in subthreshold SRAM and logic , 2016, ESSCIRC Conference 2016: 42nd European Solid-State Circuits Conference.

[137]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[138]  Lin Zhong,et al.  RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[139]  Xuan Yang,et al.  A Systematic Approach to Blocking Convolutional Neural Networks , 2016, ArXiv.

[140]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[141]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[142]  Ashok Veeraraghavan,et al.  ASP Vision: Optically Computing the First Layer of Convolutional Neural Networks Using Angle Sensitive Pixels , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[143]  Luca Benini,et al.  High-efficiency logarithmic number unit design based on an improved cotransformation scheme , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[144]  Hoi-Jun Yoo,et al.  14.1 A 126.1mW real-time natural UI/UX processor with embedded deep-learning core for low-power smart glasses , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[145]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[146]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[147]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[148]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[149]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[150]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[151]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[152]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[153]  Kaushik Roy,et al.  Conditional Deep Learning for energy-efficient and enhanced pattern recognition , 2015, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[154]  Bankman Daniel,et al.  An 8-bit, 16 input, 3.2 pJ/op switched-capacitor dot product circuit in 28-nm FDSOI CMOS , 2016 .

[155]  Youchang Kim,et al.  A 2.71 nJ/Pixel Gaze-Activated Object Recognition System for Low-Power Mobile Smart Glasses , 2016, IEEE Journal of Solid-State Circuits.

[156]  Naveen Verma,et al.  Realizing Low-Energy Classification Systems by Implementing Matrix Multiplication Directly Within an ADC , 2015, IEEE Transactions on Biomedical Circuits and Systems.

[157]  Wonyong Sung,et al.  Resiliency of Deep Neural Networks under Quantization , 2015, ArXiv.

[158]  Christian Piguet,et al.  A 1kb single-side read 6T sub-threshold SRAM in 180 nm with 530 Hz frequency 3.1 nA total current and 2.4 nA leakage at 0.27 V , 2015, 2015 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S).

[159]  Bernard Brezzo,et al.  TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[160]  Marian Verhelst,et al.  DVAS: Dynamic Voltage Accuracy Scaling for increased energy-efficiency in approximate computing , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[161]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[162]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[163]  Jie Liu,et al.  Scalable-effort classifiers for energy-efficient machine learning , 2015, DAC.

[164]  Kaushik Roy,et al.  Approximate computing and the quest for computing efficiency , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[165]  Wonyong Sung,et al.  Fixed point optimization of deep convolutional neural networks for object recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[166]  Mark Y. Liu,et al.  A 14nm logic technology featuring 2nd-generation FinFET, air-gapped interconnects, self-aligned double patterning and a 0.0588 µm2 SRAM cell size , 2014, 2014 IEEE International Electron Devices Meeting.

[167]  Jason Cong,et al.  Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.

[168]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[169]  Zhen Wang,et al.  Draining our glass: an energy and heat characterization of Google Glass , 2014, APSys.

[170]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[171]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[172]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[173]  Jean-Luc Nagel,et al.  Ultra low-power standard cell design using planar bulk CMOS in subthreshold operation , 2013, 2013 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[174]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[175]  Masahide Matsumoto,et al.  A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm technology , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[176]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[177]  David Blaauw,et al.  A Modular 1 mm$^{3}$ Die-Stacked Sensing Platform With Low Power I$^{2}$C Inter-Die Communication and Multi-Modal Energy Harvesting , 2013, IEEE Journal of Solid-State Circuits.

[178]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[179]  C. Auth,et al.  A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[180]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[181]  Sander M. Bohte,et al.  Computing with Spiking Neuron Networks , 2012, Handbook of Natural Computing.

[182]  Stephen Berard,et al.  Implications of Historical Trends in the Electrical Efficiency of Computing , 2011, IEEE Annals of the History of Computing.

[183]  Hoi-Jun Yoo,et al.  A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[184]  Pradip Bose,et al.  Power Wall , 2011, Encyclopedia of Parallel Computing.

[185]  Ju-Wan Lee,et al.  Comparison of SOI FinFETs and Bulk FinFETs , 2009 .

[186]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[187]  Pentti Kanerva,et al.  Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors , 2009, Cognitive Computation.

[188]  Mark Bohr,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007, IEEE Solid-State Circuits Newsletter.

[189]  Robert H. Dennard,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007 .

[190]  A. Nordström,et al.  What's next for WHO? , 2006, The Lancet.

[191]  G. Moore Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. , 2006, IEEE Solid-State Circuits Newsletter.

[192]  Raymond Laflamme,et al.  An Introduction to Quantum Computing , 2007, Quantum Inf. Comput..

[193]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[194]  Alan F. Murray,et al.  IEEE International Solid-State Circuits Conference , 2001 .

[195]  Theo Ungerer,et al.  Multiple-Issue Processors , 1999 .

[196]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[197]  Piero Olivo,et al.  Flash memory cells-an overview , 1997, Proc. IEEE.

[198]  Thomas D. Burd,et al.  Processor design for portable systems , 1996, J. VLSI Signal Process..

[199]  Scott Shenker,et al.  Scheduling for reduced CPU energy , 1994, OSDI '94.

[200]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[201]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.