PartIR: Composing SPMD Partitioning Strategies for Machine Learning
暂无分享,去创建一个
Norman A. Rink | Adam Paszke | D. Maclaurin | Dimitrios Vytiniotis | Michael Schaarschmidt | Tamara Norman | Sami Alabed | Bart Chrzaszcz | Tom Natan | Juliana Franco | Dominik Grewe | James Molloy | Xiaoyue Pan | Timur Sitdikov | Agnieszka Swietlik | Joel Wee
[1] Andrew Zisserman,et al. TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] Lisa Anne Hendricks,et al. Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining , 2023, EMNLP.
[3] Lisa Anne Hendricks,et al. Measuring Progress in Fine-grained Vision-and-Language Understanding , 2023, ACL.
[4] Myle Ott,et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , 2023, Proc. VLDB Endow..
[5] Peter C. Ma,et al. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings , 2023, ISCA.
[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[7] Blake A. Hechtman,et al. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models , 2022, ASPLOS.
[8] Guillem Cucurull,et al. Galactica: A Large Language Model for Science , 2022, ArXiv.
[9] J. Dean,et al. Efficiently Scaling Transformer Inference , 2022, ArXiv.
[10] Norman A. Rink,et al. Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR , 2022, ArXiv.
[11] Lawrence C. McAfee,et al. Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.
[12] A. Aiken,et al. DISTAL: the distributed tensor algebra compiler , 2022, PLDI.
[13] Norman A. Rink,et al. Automap: Towards Ergonomic Automated Parallelism for ML Models , 2021, ArXiv.
[14] Daniel D. Johnson,et al. Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming , 2021, Proc. ACM Program. Lang..
[15] Peter C. Ma,et al. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[16] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.
[17] A. Fitzgibbon,et al. DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks , 2021, EuroMLSys@EuroSys.
[18] Uday Bondhugula,et al. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[19] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[20] Rastislav Bodik,et al. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs , 2020, PACT.
[21] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[22] S. Gorlatch,et al. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies , 2020, Proc. ACM Program. Lang..
[23] Jun Yang,et al. Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads , 2020, ArXiv.
[24] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[25] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.
[26] D. Narayanan,et al. Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.
[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[28] Dehao Chen,et al. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.
[29] Jure Leskovec,et al. Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.
[30] Hans-Joachim Wittmann,et al. Force Fields , 2020, War at the Speed of Light.
[31] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[32] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[33] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[34] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[35] Travis M. Drucker,et al. High Resolution Medical Image Analysis with Spatial Partitioning , 2019, ArXiv.
[36] David Cox,et al. Triton: an intermediate language and compiler for tiled neural network computations , 2019, MAPL@PLDI.
[37] Alexandros Nikolaos Ziogas,et al. Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.
[38] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.
[39] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[40] Matei Zaharia,et al. Optimizing data-intensive computations in existing libraries with split annotations , 2018, SOSP.
[41] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[42] Eddie Q. Yan,et al. TVM: End-to-End Optimization Stack for Deep Learning , 2018, ArXiv.
[43] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[45] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[46] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[47] Andy Davis,et al. This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning , 2022 .
[48] Sam Lindley,et al. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.
[49] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.
[50] Eelco Visser,et al. Building program optimizers with rewriting strategies , 1998, ICFP '98.
[51] Alvaro Sanchez-Gonzalez,et al. Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond , 2022, ICLR.
[52] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.