Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
暂无分享,去创建一个
Joseph Gonzalez | I. Stoica | Lianmin Zheng | Yanping Huang | Danyang Zhuo | Zhifeng Chen | Yuanzhong Xu | Hao Zhang | Yida Wang | Zhuohan Li | Yonghao Zhuang
[1] Trishul M. Chilimbi,et al. MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , 2022, ArXiv.
[2] Laurent El Shafey,et al. Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.
[3] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.
[4] Nipun Kwatra,et al. Varuna: scalable, low-cost training of massive deep learning models , 2021, EuroSys.
[5] Xiaodong Yi,et al. OneFlow: Redesign the Distributed Deep Learning Framework from Scratch , 2021, ArXiv.
[6] Feng Yan,et al. Gradient Compression Supercharged High-Performance Data Parallel DNN Training , 2021, SOSP.
[7] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.
[8] Ion Stoica,et al. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training , 2021, ICML.
[9] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[10] Hao Zhang,et al. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.
[11] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[12] Chuan Wu,et al. DAPPLE: a pipelined data parallel approach for training large models , 2020, PPoPP.
[13] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[14] Tianqi Chen,et al. Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.
[15] D. Narayanan,et al. Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.
[16] Jakub Tarnawski,et al. Piper: Multidimensional Planner for DNN Parallelization , 2021, NeurIPS.
[17] Jidong Zhai,et al. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections , 2021, OSDI.
[18] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.
[19] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[20] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.
[21] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[22] Dehao Chen,et al. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.
[23] Richard J. Forrester,et al. Computational Comparison of Exact Solution Methods for 0-1 Quadratic Programs: Recommendations for Practitioners , 2020, J. Appl. Math..
[24] James Cheng,et al. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism , 2020, IEEE Transactions on Parallel and Distributed Systems.
[25] Gu Jin,et al. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.
[26] P. Abbeel,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.
[27] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[28] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.
[29] Hao Zhang,et al. AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning , 2020, NeurIPS.
[30] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[31] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[32] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.
[33] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.
[34] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[35] Byung-Gon Chun,et al. Automating System Configuration of Distributed Machine Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[36] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.
[37] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[38] Byung-Gon Chun,et al. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.
[39] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.
[40] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[41] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[42] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[43] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[44] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.
[45] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[46] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[47] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[48] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[49] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.
[50] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[51] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.
[52] Vahab S. Mirrokni,et al. Distributed Balanced Partitioning via Linear Embedding , 2015, WSDM.
[53] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[54] John Forrest,et al. CBC User Guide , 2005 .