Efficient AI System Design With Cross-Layer Approximate Computing

Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[3]  Kyuyeon Hwang,et al.  Fixed-point feedforward deep neural network design using weights +1, 0, and −1 , 2014, 2014 IEEE Workshop on Signal Processing Systems (SiPS).

[4]  Melvin A. Breuer,et al.  Multi-media applications and imprecise computation , 2005, 8th Euromicro Conference on Digital System Design (DSD'05).

[5]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[6]  Kailash Gopalakrishnan,et al.  DLFloat: A 16-b Floating Point Format Designed for Deep Learning Training and Inference , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[7]  Anantha Chandrakasan,et al.  Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[8]  Charbel Sakr,et al.  Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks , 2019, ICLR.

[9]  Kaushik Roy,et al.  Scalable effort hardware design: Exploiting algorithmic resilience for energy efficiency , 2010, Design Automation Conference.

[10]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[11]  Kaushik Roy,et al.  Energy-efficient recognition and mining processor using scalable effort design , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[12]  Hiroshi Inoue,et al.  DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator , 2019, IEEE Micro.

[13]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[17]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Swagath Venkataramani,et al.  POSTER: Design Space Exploration for Performance Optimization of Deep Neural Networks on Shared Memory Accelerators , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Swagath Venkataramani,et al.  Compensated-DNN: Energy Efficient Low-Precision Deep Neural Networks by Compensating Quantization Errors , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[20]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[21]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[22]  James T. Kwok,et al.  Loss-aware Weight Quantization of Deep Networks , 2018, ICLR.

[23]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[25]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[26]  D. Scott Cyphers,et al.  Intel® nGraphTM , 2018 .

[27]  Kaushik Roy,et al.  Neural network accelerator design with resistive crossbars: Opportunities and challenges , 2019, IBM J. Res. Dev..

[28]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[29]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[30]  Swagath Venkataramani,et al.  Accurate and Efficient 2-bit Quantized Neural Networks , 2019, MLSys.

[31]  Vijayalakshmi Srinivasan,et al.  Approximate computing: Challenges and opportunities , 2016, 2016 IEEE International Conference on Rebooting Computing (ICRC).

[32]  Wei Wang,et al.  A Compiler for Deep Neural Network Accelerators to Generate Optimized Code for a Wide Range of Data Parameters from a Hand-crafted Computation Kernel , 2019, 2019 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS).

[33]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[35]  Srihari Cadambi,et al.  A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[36]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[37]  Arnab Raha,et al.  Towards full-system energy-accuracy tradeoffs: A case study of an approximate smart camera system? , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[38]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[42]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[43]  Swagath Venkataramani,et al.  Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[44]  Michael Behar,et al.  Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[45]  Anand Raghunathan,et al.  Best-effort parallel execution framework for Recognition and mining applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[47]  Kaushik Roy,et al.  AxNN: Energy-efficient neuromorphic systems using approximate computing , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[50]  Fabrizio Lombardi,et al.  A Retrospective and Prospective View of Approximate Computing [Point of View} , 2020, Proc. IEEE.

[51]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[52]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[53]  Anand Raghunathan,et al.  Approximate memory compression for energy-efficiency , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[54]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[55]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[56]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[57]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[59]  Shuchang Zhou,et al.  Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks , 2017, Journal of Computer Science and Technology.

[60]  Sujan Kumar Gonugondla,et al.  A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array , 2018, IEEE Journal of Solid-State Circuits.

[61]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[62]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[64]  Swagath Venkataramani,et al.  BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[65]  Eunhyeok Park,et al.  Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Shimeng Yu,et al.  Neuro-Inspired Computing With Emerging Nonvolatile Memorys , 2018, Proceedings of the IEEE.

[67]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[69]  Jungwon Lee,et al.  Learning Sparse Low-Precision Neural Networks With Learnable Regularization , 2020, IEEE Access.

[70]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[72]  Charbel Sakr,et al.  Analytical Guarantees on Numerical Precision of Deep Neural Networks , 2017, ICML.

[73]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[74]  Sungroh Yoon,et al.  Big/little deep neural network for ultra low power inference , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[75]  Anand Raghunathan,et al.  SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks , 2017, IEEE Transactions on Computers.

[76]  Song Han,et al.  HAQ: Hardware-Aware Automated Quantization , 2018, ArXiv.

[77]  Sherief Reda,et al.  Runtime configurable deep neural networks for energy-accuracy trade-off , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[78]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[80]  Anand Raghunathan,et al.  TiM-DNN: Ternary In-Memory Accelerator for Deep Neural Networks , 2020, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[81]  Anke Schmeink,et al.  Variational Network Quantization , 2018, ICLR.

[82]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[83]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[84]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[85]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[86]  Vivienne Sze,et al.  14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.

[87]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[88]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[89]  Anand Raghunathan,et al.  Data Subsetting: A Data-Centric Approach to Approximate Computing , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[90]  Wei Zhang,et al.  AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[91]  David Gilmore,et al.  Modeling Order in Neural Word Embeddings at Scale , 2015, ICML.

[92]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[93]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[94]  Baochun Li,et al.  Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.

[95]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[97]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[98]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[99]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[100]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[101]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[102]  Kaushik Roy,et al.  Conditional Deep Learning for energy-efficient and enhanced pattern recognition , 2015, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[103]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[104]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[105]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[106]  Lingamneni Avinash,et al.  Sustaining moore's law in embedded computing through probabilistic and approximate design: retrospects and prospects , 2009, CASES '09.

[107]  Hossein Valavi,et al.  A Mixed-Signal Binarized Convolutional-Neural-Network Accelerator Integrating Dense Weight Storage and Multiplication for Reduced Data Movement , 2018, 2018 IEEE Symposium on VLSI Circuits.

[108]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[109]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[110]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[111]  Charbel Sakr,et al.  Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm , 2018, ICLR.

[112]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[113]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[115]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[116]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[117]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[118]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[119]  Kaushik Roy,et al.  Significance driven computation: a voltage-scalable, variation-aware, quality-tuning motion estimator , 2009, ISLPED.

[120]  Swagath Venkataramani,et al.  Performance-driven Programming of Multi-TFLOP Deep Learning Accelerators* , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[121]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[122]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[123]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[124]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[125]  Joel Silberman,et al.  A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference , 2018, 2018 IEEE Symposium on VLSI Circuits.

[126]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.