Empirical Investigation of Stale Value Tolerance on Parallel RNN Training

The objective of this paper is to provide a detailed understanding of stale value tolerance of parallel training. During parallel training, multiple workers read-and-modify shared model parameters multiple times, incurring multiple data transactions between workers, most of which are redundant due to the stale value tolerant characteristic of training. While considerable effort has tried to reduce the excessive data communication by utilizing stale value tolerance, there is a lack of detailed understanding of stale value tolerance and its dependence on multiple design choices in training of neural networks. This ambiguity has prevented domain experts from designing systems that take full advantage of the performance potential by leveraging stale value tolerance. This paper investigates how communication reduction affects the progress of parallel training for recurrent neural networks (RNN). We investigate stale value tolerance of RNN training by varying the update density, activation functions, and learning rate.

[1]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[4]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[6]  Hyesoon Kim,et al.  StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIM , 2018, IEEE Transactions on Computers.

[7]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[8]  Andrea C. Arpaci-Dusseau,et al.  Effective distributed scheduling of parallel workloads , 1996, SIGMETRICS '96.

[9]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[10]  Kevin T. Pedretti,et al.  The impact of system design parameters on application noise sensitivity , 2010, 2010 IEEE International Conference on Cluster Computing.

[11]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[12]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[13]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[14]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[15]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ioannis Mitliagkas,et al.  YellowFin and the Art of Momentum Tuning , 2017, MLSys.

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[19]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stephen Tyree,et al.  Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[22]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[23]  Alexander Mordvintsev,et al.  Inceptionism: Going Deeper into Neural Networks , 2015 .

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  David L. Black,et al.  An OSF/1 UNIX for Massively Parallel Multicomputers , 1993, USENIX Winter.

[26]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[27]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[28]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[29]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[30]  Kimberly Keeton,et al.  LazyBase: trading freshness for performance in a scalable database , 2012, EuroSys '12.

[31]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[32]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[33]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[34]  T. Lawson,et al.  Spark , 2011 .

[35]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[36]  Rajiv Gupta,et al.  ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM , 2014, OOPSLA.

[37]  Ioannis Mitliagkas,et al.  Asynchrony begets momentum, with an application to deep learning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[38]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[39]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[42]  Cho-Jui Hsieh,et al.  HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[43]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.