暂无分享,去创建一个
Dimitris S. Papailiopoulos | Shivaram Venkataraman | Kangwook Lee | Hongyi Wang | Saurabh Agarwal | Hongyi Wang | Dimitris Papailiopoulos | S. Venkataraman | Saurabh Agarwal | Kangwook Lee
[1] Ruixuan Li,et al. Accelerating Distributed Deep Learning By Adaptive Gradient Quantization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[2] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[3] Jianyu Wang,et al. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.
[4] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.
[5] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.
[6] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.
[7] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.
[8] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Michael Garland,et al. AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , 2017, ArXiv.
[11] Luis Ceze,et al. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud , 2020, MLSys.
[12] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[13] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[14] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[15] Shaohuai Shi,et al. Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.
[16] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[17] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[18] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.
[19] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[20] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[21] Wei Zhang,et al. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.
[22] Kurt Keutzer,et al. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.
[23] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.
[24] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[25] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .
[26] Joseph Gonzalez,et al. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.
[27] Yoshua Bengio,et al. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.
[28] Sebastian U. Stich,et al. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, ArXiv.
[29] Hao Liang,et al. Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[30] Shaohuai Shi,et al. A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[31] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.
[32] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[34] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Nam Sung Kim,et al. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training , 2018, NeurIPS.
[36] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[37] Parijat Dube,et al. Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.
[38] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[40] Kurt Keutzer,et al. Large batch size training of neural networks with adversarial training and second-order information , 2018, ArXiv.
[41] Dimitris S. Papailiopoulos,et al. Gradient Diversity Empowers Distributed Learning , 2017, ArXiv.
[42] Kyunghyun Cho,et al. The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.
[43] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.
[44] David J. Schwab,et al. The Early Phase of Neural Network Training , 2020, ICLR.
[45] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.
[46] Dimitris S. Papailiopoulos,et al. ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.
[47] Christopher De Sa,et al. Distributed Learning with Sublinear Communication , 2019, ICML.
[48] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.
[49] Stefano Soatto,et al. Critical Learning Periods in Deep Networks , 2018, ICLR.
[50] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.