A Quantitative Analysis on Required Network Bandwidth for Large-Scale Parallel Machine Learning

Parallelization is essential for machine learning systems that deals with large-scale dataset. Data parallel machine leaning systems that are composed of multiple machine learning modules, exchange the parameter to synchronize the models in the modules through network. We investigate the network bandwidth requirements for various parameter exchange method using a cluster simulator called SimGrid. We have confirmed that (1) direct exchange methods are substantially more efficient than parameter server based methods, and (2) with proper exchange methods, the bisection-bandwidth of network does not affect the efficiency, which implies smaller investment on network facility will be sufficient.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  John F. Canny,et al.  Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters , 2013, SDM.

[3]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[6]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[7]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[8]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[10]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[11]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.