Efficient Decentralized Deep Learning by Dynamic Model Averaging

We propose an efficient protocol for decentralized training of deep neural networks from distributed data sources. The proposed protocol allows to handle different phases of model training equally well and to quickly adapt to concept drifts. This leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. Moreover, we derive a communication bound that scales well with the hardness of the serialized learning problem. The reduction in communication comes at almost no cost, as the predictive performance remains virtually unchanged. Indeed, the proposed protocol retains loss bounds of periodically averaging schemes. An extensive empirical evaluation validates major improvement of the trade-off between model performance and communication which could be beneficial for numerous decentralized learning applications, such as autonomous driving, or voice recognition and image classification on mobile phones.

[1]  Assaf Schuster,et al.  Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations , 2013, BD3@VLDB.

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[4]  Shie Mannor,et al.  Outlier Robust Online Learning , 2017, ArXiv.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[7]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[8]  Assaf Schuster,et al.  Adaptive Communication Bounds for Distributed Online Learning , 2019, ArXiv.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Thomas Gärtner,et al.  Effective Parallelisation for Machine Learning , 2017, NIPS.

[11]  Assaf Schuster,et al.  Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Philip M. Long,et al.  Linear classifiers are nearly optimal when hidden variables have diverse effects , 2012, Machine Learning.

[13]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[14]  Michael Kamp,et al.  Communication-Efficient Distributed Online Learning with Kernels , 2016, ECML/PKDD.

[15]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[16]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[17]  Sridha Sridharan,et al.  Going Deeper: Autonomous Steering with Neural Memory Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[18]  Assaf Schuster,et al.  Shape Sensitive Geometric Monitoring , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  Sujay Sanghavi,et al.  The Local Convexity of Solving Systems of Quadratic Equations , 2015, 1506.07868.

[20]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[21]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[22]  Assaf Schuster,et al.  Prediction-based geometric monitoring over distributed data streams , 2012, SIGMOD Conference.

[23]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[24]  Assaf Schuster,et al.  Processing data streams with hard real-time constraints on heterogeneous systems , 2011, ICS '11.

[25]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[26]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[27]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[28]  Nathan Srebro,et al.  Stochastic Nonconvex Optimization with Large Minibatches , 2019, ALT.

[29]  Assaf Schuster,et al.  Monitoring Distributed Streams using Convex Decompositions , 2015, Proc. VLDB Endow..

[30]  Assaf Schuster,et al.  Distributed Threshold Querying of General Functions by a Difference of Monotonic Representation , 2010, Proc. VLDB Endow..

[31]  Assaf Schuster,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2006, Ubiquitous Knowledge Discovery.

[32]  Nikhil Ketkar Training Deep Learning Models , 2017 .

[33]  Assaf Schuster,et al.  Communication-Efficient Distributed Online Prediction by Dynamic Model Synchronization , 2014, ECML/PKDD.

[34]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[35]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[36]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[37]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[38]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[39]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[40]  Amir Abboud,et al.  Geometric Monitoring of Heterogeneous Streams , 2014, IEEE Transactions on Knowledge and Data Engineering.

[41]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[42]  SchusterAssaf,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2007 .

[43]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.