Model Aggregation via Good-Enough Model Spaces

In many applications, the training data for a machine learning task is partitioned across multiple nodes, and aggregating this data may be infeasible due to communication, privacy, or storage constraints. Existing distributed optimization methods for learning global models in these settings typically aggregate local updates from each node in an iterative fashion. However, these approaches require many rounds of communication between nodes, and assume that updates can be synchronously shared across a connected network. In this work, we present Good-Enough Model Spaces (GEMS), a novel framework for learning a global model by carefully intersecting the sets of "good-enough" models across each node. Our approach utilizes minimal communication and does not require sharing of data between nodes. We present methods for learning both convex models and neural networks within this framework and discuss how small samples of held-out data can be used for post-learning fine-tuning. In experiments on image and medical datasets, our approach on average improves upon other baseline aggregation techniques such as ensembling or model averaging by as much as 15 points (accuracy).

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[3]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[4]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[5]  Satrajit S. Ghosh,et al.  Distributed Weight Consolidation: A Brain Segmentation Case Study , 2018, NeurIPS.

[6]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[7]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[8]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[9]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[10]  Tom Michael Mitchell Version spaces: an approach to concept learning. , 1979 .

[11]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[12]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[13]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[14]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[17]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[18]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[22]  Harald Kittler,et al.  Descriptor : The HAM 10000 dataset , a large collection of multi-source dermatoscopic images of common pigmented skin lesions , 2018 .

[23]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Philip S. Yu,et al.  Privacy-Preserving Data Mining: A Survey , 2008, Handbook of Database Security.

[26]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[27]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[28]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[29]  Bruce R. Rosen,et al.  Distributed deep learning networks among institutions for medical imaging , 2018, J. Am. Medical Informatics Assoc..

[30]  Spyridon Bakas,et al.  Multi-Institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmentation , 2018, BrainLes@MICCAI.

[31]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[32]  Tie-Yan Liu,et al.  Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks , 2017, ECML/PKDD.

[33]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[34]  J. Marc Overhage,et al.  Going Digital: A Survey on Digitalization and Large-Scale Data Analytics in Healthcare , 2016, Proceedings of the IEEE.

[35]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[36]  Tassilo Klein,et al.  Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[37]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[38]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[39]  H. Simon,et al.  Rational choice and the structure of the environment. , 1956, Psychological review.

[40]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[41]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[42]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.