论文信息 - Deep Learning Compiler Optimization on Multi-Chiplet Architecture

Deep Learning Compiler Optimization on Multi-Chiplet Architecture

Multi-chiplet architecture can provide a high-performance solution for new tasks such as deep learning models. In order to fully utilize chiplets and accelerate the execution of deep learning models, we present a deep learning compilation optimization framework for chiplets, and propose a scheduling method based on data dependence. Experiments show that our method improves the compilation efficiency, and the performance of the scheduling scheme is at least 1-2 times higher than the traditional algorithms.

[1] Xuyi Cai,et al. Survey on chiplets: interface, interconnect and integration methodology , 2022, CCF Trans. High Perform. Comput..

[2] Joseph Gonzalez,et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[3] Azalia Mirhoseini,et al. A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules , 2021, MLSys.

[4] Kaisheng Ma,et al. NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[5] James Demmel,et al. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[6] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[7] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[8] Hui Yang,et al. Chiplet Heterogeneous Integration Technology—Status and Challenges , 2020, Electronics.

[9] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[10] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[11] Xuehai Qian,et al. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12] Mingyu Gao,et al. Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.

[13] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[15] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jongman Kim,et al. Virtualizing Virtual Channels for Increased Network-on-Chip Robustness and Upgradeability , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[18] Li Shang,et al. Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20] G. Moore. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. , 2006, IEEE Solid-State Circuits Newsletter.

[21] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.