论文信息 - NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

The revolution of machine learning poses an unprecedented demand for computation resources, urging more transistors on a single monolithic chip, which is not sustainable in the Post-Moore era. The multichip integration with small functional dies, called chiplets, can reduce the manufacturing cost, improve the fabrication yield, and achieve die-level reuse for different system scales. DNN workload mapping and hardware design space exploration on such multichip systems are critical, but missing in the current stage.This work provides a hierarchical and analytical framework to describe the DNN mapping on a multichip accelerator and analyze the communication overhead. Based on this framework, we propose an automatic tool called NN-Baton with a pre-design flow and a post-design flow. The pre-design flow aims to guide the chiplet granularity exploration with given area and performance budgets for the target workload. The post-design flow focuses on the workload orchestration on different computation levels -package, chiplet, and core - in the hierarchy. Compared to Simba, NN-Baton generates mapping strategies that save 22.5%∼44% energy under the same computation and memory configurations.The architecture exploration demonstrates that area is a decisive factor for the chiplet granularity. For a 2048-MAC system under a 2 mm2 chiplet area constraint, the 4-chiplet implementation with 4 cores and 16 lanes of 8-size vector-MAC is always the top-pick computation allocation across several benchmarks. In contrast, the optimal memory allocation policy in the hierarchy typically depends on the neural network models.

[1] Pengfei Xu,et al. AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs , 2020, FPGA.

[2] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3] Andrew Stewart,et al. 10.1 A pin-efficient 20.83Gb/s/wire 0.94pJ/bit forwarded clock CNRZ-5-coded SerDes up to 12mm for MCM packages in 28nm CMOS , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[4] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5] Meng-Fan Chang,et al. 15.4 A 22nm 2Mb ReRAM Compute-in-Memory Macro with 121-28TOPS/W for Multibit MAC Computing for Tiny AI Edge Devices , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[6] Ben H. H. Juurlink,et al. (When) Will CMPs Hit the Power Wall? , 2009, Euro-Par Workshops.

[7] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[8] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9] Mariagrazia Graziano,et al. New Logic-In-Memory Paradigms: An Architectural and Technological Perspective , 2019, Micromachines.

[10] Rong Jin,et al. 7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[11] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[14] Norman P. Jouppi,et al. Google's Training Chips Revealed: TPUv2 and TPUv3 , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[15] Xi Chen,et al. A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[16] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] William J. Dally,et al. A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm , 2019, 2019 Symposium on VLSI Circuits.

[18] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[19] William J. Dally,et al. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm , 2020, IEEE Journal of Solid-State Circuits.

[20] Vivek Sarkar,et al. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.

[21] Hoi-Jun Yoo,et al. B-Face: 0.2 MW CNN-Based Face Recognition Processor with Face Alignment for Mobile User Identification , 2018, 2018 IEEE Symposium on VLSI Circuits.

[22] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[23] Sean White,et al. ‘Zeppelin’: An SoC for multichip architectures , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[24] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25] Rob A. Rutenbar,et al. A Scalable Bayesian Inference Accelerator for Unsupervised Learning , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[26] Lei He,et al. Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks , 2020, FPGA.

[27] Ron Ho,et al. 3.3 A 14nm 1GHz FPGA with 2.5D transceiver integration , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[28] Jason Cong,et al. Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29] Minsoo Rhu,et al. Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[30] William J. Dally,et al. A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[31] Chia-Hung Liu,et al. 7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[32] Teja Singh,et al. 2.1 Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[33] Marian Verhelst,et al. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[34] Jing Xia,et al. DaVinci: A Scalable Architecture for Neural Network Computing , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[35] Natalie D. Enright Jerger,et al. Modular Routing Design for Chiplet-Based Systems , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[36] Hao-Jie Zhan,et al. A 16nm 256-bit wide 89.6GByte/s total bandwidth in-package interconnect with 0.3V swing and 0.062pJ/bit power in InFO package , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[37] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[38] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Meng-Fan Chang,et al. 14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[40] Ralph Wittig,et al. Xilinx Versal™ Premium , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[41] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[42] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[43] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[44] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[45] William J. Dally,et al. MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[46] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[47] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[48] Meng-Fan Chang,et al. Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers , 2018, 2018 IEEE Symposium on VLSI Circuits.

[49] Hoi-Jun Yoo,et al. UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[50] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[52] Liang Han,et al. Hanguang 800 NPU – The Ultimate AI Inference Solution for Data Centers , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[53] Yu Cao,et al. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[54] Vivienne Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[55] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[56] Brucek Khailany,et al. Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[57] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Jason Cong,et al. Scaling for edge inference of deep neural networks , 2018 .

[59] Debjit Das Sarma,et al. Computer and Redundancy Solution for the Full Self-Driving Computer , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[60] Jae-Gon Lee,et al. 7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[61] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[62] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[63] Patrick Groeneveld,et al. ISPD 2020 Physical Mapping of Neural Networks on a Wafer-Scale Deep Learning Accelerator , 2020, ISPD.

[64] Yun Liang,et al. REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs , 2019, FPGA.

[65] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[66] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[67] Jack Hu,et al. A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design for High Performance Computing , 2019, 2019 Symposium on VLSI Circuits.

[68] Yong Wang,et al. Baidu Kunlun An AI processor for diversified workloads , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).