Elastic parameter server load distribution in deep learning clusters
暂无分享,去创建一个
Yibo Zhu | Chuanxiong Guo | Chuan Wu | Yanghua Peng | Yangrui Chen | Yixin Bao | Yibo Zhu | Chuan Wu | Chuanxiong Guo | Yanghua Peng | Yangrui Chen | Y. Bao
[1] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[2] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.
[3] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Alexandros G. Dimakis,et al. Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.
[5] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[6] Gregory R. Ganger,et al. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.
[7] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.
[8] Richard G. Baraniuk,et al. Connection-level analysis and modeling of network traffic , 2001, IMW '01.
[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[10] Bo Li,et al. Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.
[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[12] Jiawei Jiang,et al. Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.
[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Eric P. Xing,et al. Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.
[15] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[16] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[17] Michael J. Freedman,et al. Resource Elasticity in Distributed Deep Learning , 2020, MLSys.
[18] Shengen Yan,et al. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach , 2017, 2017 IEEE International Conference on Smart Computing (SMARTCOMP).
[19] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[20] G. Newman,et al. CONFIDENCE INTERVALS , 1987, The Lancet.
[21] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[22] Chuan Wu,et al. Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.
[23] Wei Zhang,et al. Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.
[24] Wei Lin,et al. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters , 2019, IEEE Transactions on Parallel and Distributed Systems.
[25] Xin Yuan,et al. Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[26] Shuai Wang,et al. Accelerating Distributed Machine Learning by Smart Parameter Server , 2019, APNet.
[27] H. Robbins. A Stochastic Approximation Method , 1951 .
[28] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.
[29] Zongpeng Li,et al. Online Job Scheduling in Distributed Machine Learning Clusters , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.
[30] Chuan Wu,et al. Deep Learning-based Job Placement in Distributed Machine Learning Clusters , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.
[31] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[32] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .
[33] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..
[34] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[35] Abutalib Aghayev,et al. Litz: Elastic Framework for High-Performance Distributed Machine Learning , 2018, USENIX Annual Technical Conference.
[36] Aijun An,et al. Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[37] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[38] Xuehai Qian,et al. Hop: Heterogeneity-aware Decentralized Training , 2019, ASPLOS.
[39] Chuan Wu,et al. Preemptive All-reduce Scheduling for Expediting Distributed DNN Training , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.
[40] Arthur Charguéraud,et al. Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.
[41] Yaoliang Yu,et al. Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.