Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[3]  A Ang Li,et al.  GPU performance modeling and optimization , 2016 .

[4]  Michael F. P. O'Boyle,et al.  Exploiting GPU Hardware Saturation for Fast Compiler Optimization , 2014, GPGPU@ASPLOS.

[5]  Les E. Atlas,et al.  Recurrent neural networks and robust time series prediction , 1994, IEEE Trans. Neural Networks.

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Arun Agarwal,et al.  Recurrent neural network and a hybrid model for prediction of stock returns , 2015, Expert Syst. Appl..

[8]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[11]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[12]  V. Ediger,et al.  ARIMA forecasting of primary energy demand by fuel in Turkey , 2007 .

[13]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[14]  Michael Garland,et al.  AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , 2017, ArXiv.

[15]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[16]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[18]  Scott Shenker,et al.  Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[21]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[22]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[23]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[24]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[25]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[28]  CadenasErasmo,et al.  Wind speed forecasting using the NARX model, case , 2016 .

[29]  Hans W. Griepentrog,et al.  NARX Neural Network Modelling of Mushroom Dynamic Vapour Sorption Kinetics , 2016 .

[30]  Zhi-Hua Zhou,et al.  Deep Learning for Fixed Model Reuse , 2017, AAAI.

[31]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[32]  David W. Jacobs,et al.  Automated Inference with Adaptive Batches , 2017, AISTATS.

[33]  Yang Song,et al.  Adaptive Block and Batch Sizing for Batched Stream Processing System , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[34]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[35]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[38]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[41]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[42]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[43]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[44]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Meng Joo Er,et al.  NARMAX-model-based time series modeling and prediction: feedforward and recurrent fuzzy neural network approaches , 2003 .

[47]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[48]  Eugen Diaconescu,et al.  The use of NARX neural networks to predict chaotic time series , 2008 .

[49]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[50]  Erasmo Cadenas,et al.  Wind speed forecasting using the NARX model, case: La Mata, Oaxaca, México , 2015, Neural Computing and Applications.

[51]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[52]  T. Ozaki On the Order Determination of Arima Models , 1977 .

[53]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[54]  Chen Meng,et al.  Training Deeper Models by GPU Memory Optimization on TensorFlow , 2017 .