On Consensus-Optimality Trade-offs in Collaborative Deep Learning

In distributed machine learning, where agents collaboratively learn from diverse private data sets, there is a fundamental tension between consensus and optimality. In this paper, we build on recent algorithmic progresses in distributed deep learning to explore various consensus-optimality trade-offs over a fixed communication topology. First, we propose the incremental consensus-based distributed SGD (i-CDSGD) algorithm, which involves multiple consensus steps (where each agent communicates information with its neighbors) within each SGD iteration. Second, we propose the generalized consensus-based distributed SGD (g-CDSGD) algorithm that enables us to navigate the full spectrum from complete consensus (all agents agree) to complete disagreement (each agent converges to individual model parameters). We analytically establish convergence of the proposed algorithms for strongly convex and nonconvex objective functions; we also analyze the momentum variants of the algorithms for the strongly convex case. We support our algorithms via numerical experiments, and demonstrate significant improvements over existing methods for collaborative deep learning.

[1]  Albert S. Berahas,et al.  Balancing Communication and Computation in Distributed Optimization , 2017, IEEE Transactions on Automatic Control.

[2]  Aryan Mokhtari,et al.  Personalized Federated Learning: A Meta-Learning Approach , 2020, ArXiv.

[3]  Angelia Nedic,et al.  A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks , 2018, 2020 Information Theory and Applications Workshop (ITA).

[4]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[5]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[6]  Shusen Yang,et al.  Distributed optimization in energy harvesting sensor networks with dynamic in-network data processing , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[7]  Milind Tambe,et al.  Distributed Sensor Networks: A Multiagent Perspective , 2003 .

[8]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[9]  Antonios Tsourdos,et al.  Distributed estimation over a low-cost sensor network: A Review of state-of-the-art , 2020, Inf. Fusion.

[10]  Milind Tambe,et al.  Distributed Sensor Networks , 2003, Multiagent Systems, Artificial Societies, and Simulated Organizations.

[11]  Michael G. Rabbat,et al.  Distributed strongly convex optimization , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Anand D. Sarwate,et al.  Learning from Data with Heterogeneous Noise using SGD , 2014, AISTATS.

[13]  Yuanxiong Guo,et al.  Differentially Private Federated Learning for Resource-Constrained Internet of Things , 2020, ArXiv.

[14]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[15]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[16]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[17]  Nicholas D. Lane,et al.  An Early Resource Characterization of Deep Learning on Wearables, Smartphones and Internet-of-Things Devices , 2015, IoT-App@SenSys.

[18]  Pradeep Dubey,et al.  Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[19]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[20]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[21]  Mohsen Guizani,et al.  Reliable Federated Learning for Mobile Networks , 2019, IEEE Wireless Communications.

[22]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[23]  Kushal Mukherjee,et al.  Generalised gossip-based subgradient method for distributed optimisation , 2019, Int. J. Control.

[24]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[25]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[26]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[27]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[28]  Yasaman Esfandiari,et al.  Cross-Gradient Aggregation for Decentralized Learning from Non-IID data , 2021, ICML.

[29]  Aryan Mokhtari,et al.  Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks , 2021, NeurIPS.

[30]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[31]  Ling Shi,et al.  Consensus-Based Data-Privacy Preserving Data Aggregation , 2019, IEEE Transactions on Automatic Control.

[32]  Qing-Long Han,et al.  A Dynamic Event-Triggered Transmission Scheme for Distributed Set-Membership Estimation Over Wireless Sensor Networks , 2019, IEEE Transactions on Cybernetics.

[33]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[34]  Zheng Xu,et al.  Adaptive Consensus ADMM for Distributed Optimization , 2017, ICML.

[35]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[36]  Mingyi Hong,et al.  Gradient Primal-Dual Algorithm Converges to Second-Order Stationary Solution for Nonconvex Distributed Optimization Over Networks , 2018, ICML.

[37]  Mehmed Kantardzic,et al.  Learning from Data , 2011 .

[38]  Fuxiao Tan,et al.  The Algorithms of Distributed Learning and Distributed Estimation about Intelligent Wireless Sensor Network , 2020, Sensors.

[39]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.

[40]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[41]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[42]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[43]  Tong Zhang,et al.  Projection-free Distributed Online Learning in Networks , 2017, ICML.

[44]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[45]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[46]  Nicholas D. Lane,et al.  Can Deep Learning Revolutionize Mobile Sensing? , 2015, HotMobile.

[47]  Matthieu Cord,et al.  Gossip training for deep learning , 2016, ArXiv.

[48]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[49]  Xinyi Le,et al.  Distributed optimization for the multi-robot system using a neurodynamic approach , 2019, Neurocomputing.

[50]  Na Li,et al.  Accelerated Distributed Nesterov Gradient Descent , 2017, IEEE Transactions on Automatic Control.