Deep Learning Compiler Optimization on Multi-Chiplet Architecture
暂无分享,去创建一个
[1] Xuyi Cai,et al. Survey on chiplets: interface, interconnect and integration methodology , 2022, CCF Trans. High Perform. Comput..
[2] Joseph Gonzalez,et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.
[3] Azalia Mirhoseini,et al. A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules , 2021, MLSys.
[4] Kaisheng Ma,et al. NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[5] James Demmel,et al. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[6] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.
[7] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[8] Hui Yang,et al. Chiplet Heterogeneous Integration Technology—Status and Challenges , 2020, Electronics.
[9] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.
[10] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.
[11] Xuehai Qian,et al. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[12] Mingyu Gao,et al. Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.
[13] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.
[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[15] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Jongman Kim,et al. Virtualizing Virtual Channels for Increased Network-on-Chip Robustness and Upgradeability , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.
[18] Li Shang,et al. Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[20] G. Moore. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. , 2006, IEEE Solid-State Circuits Newsletter.
[21] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.