A Closer Look at Codistillation for Distributed Training
暂无分享,去创建一个
Michael G. Rabbat | Olivier Delalleau | Nicolas Ballas | Shagun Sodhani | Mahmoud Assran | Koustuv Sinha
[1] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[2] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[3] Philipp Koehn,et al. Proc. International Workshop on Spoken Language Translation , 2012 .
[4] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.
[5] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[7] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[8] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[10] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[12] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.
[13] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.
[14] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[15] Dimitris S. Papailiopoulos,et al. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning , 2017, AISTATS.
[16] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.
[17] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.
[18] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[19] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[20] Geoffrey E. Hinton,et al. Large scale distributed neural network training through online distillation , 2018, ICLR.
[21] Huchuan Lu,et al. Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[22] Zhi Zhang,et al. Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.
[24] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[25] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.
[26] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[27] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[28] Shenghuo Zhu,et al. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.
[29] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[30] Peter Henderson,et al. Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning , 2020, ArXiv.
[31] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[32] Martin Jaggi,et al. Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.
[33] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[34] Tyler B. Johnson,et al. AdaScale SGD: A User-Friendly Algorithm for Distributed Training , 2020, ICML.
[35] Yuanzhi Li,et al. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, International Conference on Learning Representations.
[36] Michael G. Rabbat,et al. SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2019, ICLR.
[37] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[38] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.