Decentralized Bayesian Learning over Graphs

We propose a decentralized learning algorithm over a general social network. The algorithm leaves the training data distributed on the mobile devices while utilizing a peer to peer model aggregation method. The proposed algorithm allows agents with local data to learn a shared model explaining the global training data in a decentralized fashion. The proposed algorithm can be viewed as a Bayesian and peer-to-peer variant of federated learning in which each agent keeps a "posterior probability distribution" over a global model parameters. The agent update its "posterior" based on 1) the local training data and 2) the asynchronous communication and model aggregation with their 1-hop neighbors. This Bayesian formulation allows for a systematic treatment of model aggregation over any arbitrary connected graph. Furthermore, it provides strong analytic guarantees on converge in the realizable case as well as a closed form characterization of the rate of convergence. We also show that our methodology can be combined with efficient Bayesian inference techniques to train Bayesian neural networks in a decentralized manner. By empirical studies we show that our theoretical analysis can guide the design of network/social interactions and data partitioning to achieve convergence.

[1]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[2]  T. Javidi,et al.  Social learning and distributed hypothesis testing , 2014, 2014 IEEE International Symposium on Information Theory.

[3]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[4]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[5]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[8]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[9]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[10]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[11]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[12]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[13]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[14]  Shahin Shahrampour,et al.  Distributed Detection: Finite-Time Analysis and Impact of Network Topology , 2014, IEEE Transactions on Automatic Control.

[15]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[16]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.

[17]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[18]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[19]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[20]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[21]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[22]  Angelia Nedic,et al.  Nonasymptotic convergence rates for cooperative learning over time-varying directed graphs , 2014, 2015 American Control Conference (ACC).

[23]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[24]  Ameet Talwalkar,et al.  Parle: parallelizing stochastic gradient descent , 2017, ArXiv.

[25]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[26]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[27]  Asuman E. Ozdaglar,et al.  Distributed Alternating Direction Method of Multipliers , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[28]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[29]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[30]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.