暂无分享,去创建一个
Ammar Ahmad Awan | Hanlin Tang | Samyam Rajbhandari | Conglong Li | Yuxiong He | Samyam Rajbhandari | Yuxiong He | Conglong Li | Hanlin Tang | A. Awan
[1] Martin Jaggi,et al. PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning , 2020, NeurIPS.
[2] Minjia Zhang,et al. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.
[3] Dan Alistarh,et al. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.
[4] Yongjian Wu,et al. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters , 2020, MLSys.
[5] Martin Jaggi,et al. Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.
[6] Kamyar Azizzadenesheli,et al. signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.
[7] Peng Jiang,et al. A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.
[8] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[9] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[10] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[11] Eduard A. Gorbunov,et al. Linearly Converging Error Compensated SGD , 2020, NeurIPS.
[12] Nam Sung Kim,et al. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.
[13] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[14] Georgios B. Giannakis,et al. Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients , 2019, NeurIPS.
[15] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.
[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[17] Shaohuai Shi,et al. A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[18] Christopher Ré,et al. Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care , 2015, NIPS.
[19] Indranil Gupta,et al. CSER: Communication-efficient SGD with Error Reset , 2020, NeurIPS.
[20] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.
[21] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.
[22] Vladimir Braverman,et al. Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.
[23] Anastasios Kyrillidis,et al. Compressing Gradient Optimizers via Count-Sketches , 2019, ICML.
[24] Min Ye,et al. Communication-Computation Efficient Gradient Coding , 2018, ICML.
[25] James T. Kwok,et al. Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.
[26] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.
[27] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.
[28] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[29] Sanjiv Kumar,et al. cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.
[30] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.
[31] Le Trieu Phong,et al. Distributed SGD With Flexible Gradient Compression , 2020, IEEE Access.
[32] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.
[33] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[34] Nenghai Yu,et al. Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.
[35] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[36] Longbo Huang,et al. Double Quantization for Communication-Efficient Distributed Optimization , 2018, NeurIPS.
[37] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[38] Xiangru Lian,et al. 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , 2021, ICML.
[39] Aryan Mokhtari,et al. Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.