Doing more with less: training large DNN models on commodity servers for the masses
暂无分享,去创建一个
Amar Phanishayee | Nam Sung Kim | Derek Murray | Youjie Li | D. Murray | Amar Phanishayee | N. Kim | Youjie Li
[1] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[2] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[3] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.
[4] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[5] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.
[6] Bin Gu,et al. Decoupled Parallel Backpropagation with Convergence Guarantee , 2018, ICML.
[7] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[8] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[9] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[10] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.
[11] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .
[12] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[13] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[14] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[15] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[16] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[17] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[18] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[19] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[20] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[21] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[22] Robert N. M. Watson,et al. Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.
[23] Xu Liu,et al. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.
[24] Minjia Zhang,et al. Sentinel: Runtime Data Management on Heterogeneous Main MemorySystems for Deep Learning , 2019, ArXiv.
[25] David R. O'Hallaron,et al. Computer Systems: A Programmer's Perspective , 1991 .
[26] Amar Phanishayee,et al. Memory-Efficient Pipeline-Parallel DNN Training , 2021, ICML.
[27] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[28] Carl F. Sabottke,et al. The Effect of Image Resolution on Deep Learning in Radiography. , 2020, Radiology. Artificial intelligence.
[29] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[30] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[31] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.
[32] Bingsheng He,et al. Efficient Memory Management for GPU-based Deep Learning Systems , 2019, ArXiv.
[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[34] Gu Jin,et al. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.
[35] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Patrick Wendell,et al. Sparrow: distributed, low latency scheduling , 2013, SOSP.
[37] Dong Liu,et al. High-Resolution Representations for Labeling Pixels and Regions , 2019, ArXiv.
[38] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.
[39] Zenglin Xu,et al. Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.
[40] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..
[41] Dong Yu,et al. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.
[42] Andrew V. Goldberg,et al. Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.
[43] Alok Aggarwal,et al. Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.
[44] Hai Jin,et al. Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.
[45] O. Stegle,et al. Deep learning for computational biology , 2016, Molecular systems biology.
[46] Paras Lakhani,et al. The Importance of Image Resolution in Building Deep Learning Models for Medical Imaging. , 2020, Radiology. Artificial intelligence.
[47] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[48] Amar Phanishayee,et al. Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[49] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[50] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[51] Hai Jin,et al. Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures , 2018, ACM Trans. Archit. Code Optim..
[52] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[53] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.