Deep Learning Compiler Optimization on Multi-Chiplet Architecture

Multi-chiplet architecture can provide a high-performance solution for new tasks such as deep learning models. In order to fully utilize chiplets and accelerate the execution of deep learning models, we present a deep learning compilation optimization framework for chiplets, and propose a scheduling method based on data dependence. Experiments show that our method improves the compilation efficiency, and the performance of the scheduling scheme is at least 1-2 times higher than the traditional algorithms.

[1]  Xuyi Cai,et al.  Survey on chiplets: interface, interconnect and integration methodology , 2022, CCF Trans. High Perform. Comput..

[2]  Joseph Gonzalez,et al.  Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[3]  Azalia Mirhoseini,et al.  A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules , 2021, MLSys.

[4]  Kaisheng Ma,et al.  NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[5]  James Demmel,et al.  CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[6]  Zhiqiang Xie,et al.  Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[7]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[8]  Hui Yang,et al.  Chiplet Heterogeneous Integration Technology—Status and Challenges , 2020, Electronics.

[9]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[10]  Christoforos E. Kozyrakis,et al.  TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[11]  Xuehai Qian,et al.  HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Mingyu Gao,et al.  Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.

[13]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jongman Kim,et al.  Virtualizing Virtual Channels for Increased Network-on-Chip Robustness and Upgradeability , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[18]  Li Shang,et al.  Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  G. Moore Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. , 2006, IEEE Solid-State Circuits Newsletter.

[21]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.