Communication-Efficient and Decentralized Multi-Task Boosting while Learning the Collaboration Graph

We study the decentralized machine learning scenario where many users collaborate to learn personalized models based on (i) their local datasets and (ii) a similarity graph over the users' learning tasks. Our approach trains nonlinear classifiers in a multi-task boosting manner without exchanging personal data and with low communication costs. When background knowledge about task similarities is not available, we propose to jointly learn the personalized models and a sparse collaboration graph through an alternating optimization procedure. We analyze the convergence rate, memory consumption and communication complexity of our decentralized algorithms, and demonstrate the benefits of our approach compared to competing techniques on synthetic and real datasets.

[1]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[2]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[3]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[4]  Pascal Frossard,et al.  Learning Laplacian Matrix in Smooth Graph Signal Representations , 2014, IEEE Transactions on Signal Processing.

[5]  Anne-Marie Kermarrec,et al.  Gossip-based peer sampling , 2007, TOCS.

[6]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[7]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[8]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[9]  Zoran Obradovic,et al.  The distributed boosting algorithm , 2001, KDD '01.

[10]  Jeff Cooper,et al.  Improved algorithms for distributed boosting , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[13]  Changshui Zhang,et al.  Network Game and Boosting , 2005, ECML.

[14]  Chunhua Shen,et al.  On the Dual Formulation of Boosting Algorithms , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[16]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[17]  Zhi-Quan Luo,et al.  A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , 2012, SIAM J. Optim..

[18]  Frédéric Koriche,et al.  Compiling Combinatorial Prediction Games , 2018, ICML.

[19]  E Weinan,et al.  Functional Frank-Wolfe Boosting for General Loss Functions , 2015, ArXiv.

[20]  Mladen Kolar,et al.  Distributed Multi-Task Learning with Shared Representation , 2016, ArXiv.

[21]  Vassilis Kalofolias,et al.  How to Learn a Graph from Smooth Signals , 2016, AISTATS.

[22]  Hisashi Kashima,et al.  Distributed Multi-task Learning for Sensor Network , 2017, ECML/PKDD.

[23]  Qiang Yang,et al.  Collaborative boosting for activity classification in microblogs , 2013, KDD.

[24]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[25]  Davide Anguita,et al.  A Public Domain Dataset for Human Activity Recognition using Smartphones , 2013, ESANN.

[26]  Rachid Guerraoui,et al.  Personalized and Private Peer-to-Peer Machine Learning , 2017, AISTATS.

[27]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[28]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[29]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[30]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[31]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[32]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[33]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[34]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[35]  Jiayu Zhou,et al.  Asynchronous Multi-task Learning , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[36]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[37]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[38]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[39]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[40]  Marc Tommasi,et al.  Decentralized Collaborative Learning of Personalized Models over Networks , 2016, AISTATS.

[41]  Eric Moulines,et al.  D-FW: Communication efficient distributed algorithms for high-dimensional sparse optimization , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  H. Goldstein Multilevel Modelling of Survey Data , 1991 .

[43]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[44]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[45]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[46]  Asuman E. Ozdaglar,et al.  Distributed Alternating Direction Method of Multipliers , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[47]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[48]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[49]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[50]  Andreas Maurer,et al.  The Rademacher Complexity of Linear Transformation Classes , 2006, COLT.

[51]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[52]  Mladen Kolar,et al.  Distributed Multitask Learning , 2015, ArXiv.