BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy
暂无分享,去创建一个
Minsik Cho | Ulrich Finkler | David S. Kung | Hillery C. Hunter | Mauricio J. Serrano | Minsik Cho | M. Serrano | H. Hunter | Ulrich Finkler
[1] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[2] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[3] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[4] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.
[5] Tevfik Kosar,et al. Application-Level Optimization of Big Data Transfers through Pipelining, Parallelism and Concurrency , 2016, IEEE Transactions on Cloud Computing.
[6] Amit Agarwal,et al. CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.
[7] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[8] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[9] Janis Keuper,et al. Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[10] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[11] Robert L. Grossman,et al. PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[12] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[13] Konstantin S. Solnushkin. Automated Design of Torus Networks , 2013, ArXiv.
[14] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.
[15] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[16] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.
[17] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[18] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[19] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[20] Pradeep Dubey,et al. On Scale-out Deep Learning Training for Cloud and HPC , 2018, ArXiv.
[21] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[22] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.
[23] Kiril Dichev,et al. Optimization of Collective Communication for Heterogeneous HPC Platforms , 2014, HiPC 2014.
[24] Philip Heidelberger,et al. Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.
[25] Javier Romero,et al. Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.
[26] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[29] Ji Liu,et al. Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.
[30] Amin Vahdat,et al. A scalable, commodity data center network architecture , 2008, SIGCOMM '08.
[31] Yusuke Niitani,et al. ChainerCV: a Library for Deep Learning in Computer Vision , 2017, ACM Multimedia.
[32] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.