BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
暂无分享,去创建一个
[1] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[3] Nuwan S. Ferdinand,et al. Anytime Exploitation of Stragglers in Synchronous Stochastic Gradient Descent , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).
[4] Olatunji Ruwase,et al. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.
[5] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[6] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[7] Alexandros G. Dimakis,et al. Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.
[8] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.
[9] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Eric P. Xing,et al. Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.
[12] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..
[13] H. Robbins. A Stochastic Approximation Method , 1951 .
[14] Dirk Merkel,et al. Docker: lightweight Linux containers for consistent development and deployment , 2014 .
[15] Jiawei Jiang,et al. Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.
[16] Chan-Hyun Youn,et al. An Adaptive Batch-Orchestration Algorithm for the Heterogeneous GPU Cluster Environment in Distributed Deep Learning System , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).
[17] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[18] Stephen P. Boyd,et al. CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..
[19] Christopher Ré,et al. DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..
[20] Wencong Xiao,et al. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications , 2018 .
[21] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[22] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[23] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[24] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[25] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.