A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim

The compute demand for deep learning workloads is well known and is a prime motivator for powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. The massive inherent parallelism of these workloads enables us to extract more performance by simply provisioning more compute hardware for a given task. This strategy can be directly exploited to build higher-performing hardware for DNN workloads, by incorporating as many parallel compute units as possible in a single system. This strategy is referred to as scaling up. Alternatively, it's feasible to arrange multiple hardware systems to work on a single problem, and in some cases, a cheaper alternative to exploit the given parallelism, or in other words, scaling out. As DNN based solutions become increasingly prevalent, so does the demand for computation, making the scaling choice (scale-up vs scale-out) critical. To study this design-space, this work makes two major contributions. (i) We describe a cycle-accurate simulator called SCALE-SIM for DNN inference on systolic arrays, which we use to model both scale-up and scale-out systems, modeling on-chip memory access, runtime, and DRAM bandwidth requirements for a given workload. (ii) We also present an analytical model to estimate the optimal scale-up vs scale-out ratio given hardware constraints (e.g, TOPS and DRAM bandwidth) for a given workload. We observe that a judicious choice of scaling can lead to performance improvements as high as 50 per layer, within the available DRAM bandwidth. This work demonstrates and analyzes the trade-off space for performance, DRAM bandwidth, and energy, and identifies sweet spots for various workloads and hardware configurations.

[1]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2]  Gu-Yeon Wei,et al.  A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators , 2019, 2019 Symposium on VLSI Circuits.

[3]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[4]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[5]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[6]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[8]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9]  Jason Clemons,et al.  Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration , 2019, ASPLOS.

[10]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  H. T. Kung,et al.  Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[12]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Patrick Hansen,et al.  FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning , 2019, ArXiv.

[15]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[16]  Ryan P. Adams,et al.  SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers , 2019, NeurIPS.

[17]  Gu-Yeon Wei,et al.  A 16-nm Always-On DNN Processor With Adaptive Clocking and Multi-Cycle Banked SRAMs , 2019, IEEE Journal of Solid-State Circuits.

[18]  Christoforos E. Kozyrakis,et al.  TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[19]  Yuhao Zhu,et al.  ASV: Accelerated Stereo Vision System , 2019, MICRO.

[20]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Matthew Mattina,et al.  Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[22]  H.-S. Philip Wong,et al.  On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[23]  Vivek Sarkar,et al.  Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.

[24]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[26]  Gu-Yeon Wei,et al.  SMAUG , 2019, ACM Trans. Archit. Code Optim..

[27]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[28]  Gu-Yeon Wei,et al.  Applications of Deep Neural Networks for Ultra Low Power IoT , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[29]  Swagath Venkataramani,et al.  DyHard-DNN: Even More DNN Acceleration with Dynamic Hardware Reconfiguration , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.