A Framework for Parallel and Distributed Training of Neural Networks

The aim of this paper is to develop a general framework for training neural networks (NNs) in a distributed environment, where training data is partitioned over a set of agents that communicate with each other through a sparse, possibly time-varying, connectivity pattern. In such distributed scenario, the training problem can be formulated as the (regularized) optimization of a non-convex social cost function, given by the sum of local (non-convex) costs, where each agent contributes with a single error term defined with respect to its local dataset. To devise a flexible and efficient solution, we customize a recently proposed framework for non-convex optimization over networks, which hinges on a (primal) convexification-decomposition technique to handle non-convexity, and a dynamic consensus procedure to diffuse information among the agents. Several typical choices for the training criterion (e.g., squared loss, cross entropy, etc.) and regularization (e.g., ℓ2 norm, sparsity inducing penalties, etc.) are included in the framework and explored along the paper. Convergence to a stationary solution of the social non-convex problem is guaranteed under mild assumptions. Additionally, we show a principled way allowing each agent to exploit a possible multi-core architecture (e.g., a local cloud) in order to parallelize its local optimization step, resulting in strategies that are both distributed (across the agents) and parallel (inside each agent) in nature. A comprehensive set of experimental results validate the proposed approach.

[1]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  H. Vincent Poor,et al.  A Collaborative Training Algorithm for Distributed Learning , 2009, IEEE Transactions on Information Theory.

[4]  Pascal Bianchi,et al.  Convergence of a Multi-Agent Projected Stochastic Gradient Algorithm for Non-Convex Optimization , 2011, IEEE Transactions on Automatic Control.

[5]  Makoto Yokoo,et al.  Adopt: asynchronous distributed constraint optimization with quality guarantees , 2005, Artif. Intell..

[6]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[7]  Dianhui Wang,et al.  A decentralized training algorithm for Echo State Networks in distributed big data applications , 2016, Neural Networks.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Dianhui Wang,et al.  Distributed learning for Random Vector Functional-Link networks , 2015, Inf. Sci..

[10]  Tamir Tassa,et al.  A privacy-preserving algorithm for distributed constraint optimization , 2014, AAMAS.

[11]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[12]  Stephen P. Boyd,et al.  Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[13]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[14]  Georgios B. Giannakis,et al.  Consensus-Based Distributed Support Vector Machines , 2010, J. Mach. Learn. Res..

[15]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  Francisco Facchinei,et al.  Parallel Selective Algorithms for Nonconvex Big Data Optimization , 2014, IEEE Transactions on Signal Processing.

[18]  Thomas Brox,et al.  On Iteratively Reweighted Algorithms for Nonsmooth Nonconvex Optimization in Computer Vision , 2015, SIAM J. Imaging Sci..

[19]  O. Boric-Lubeke,et al.  Wireless house calls: using communications technology for health care and monitoring , 2002 .

[20]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[21]  Gregory J. Pottie,et al.  Wireless integrated network sensors , 2000, Commun. ACM.

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Martin Hasler,et al.  Distributed machine learning in networks by consensus , 2014, Neurocomputing.

[24]  Ali H. Sayed,et al.  Diffusion Least-Mean Squares Over Adaptive Networks: Formulation and Performance Analysis , 2008, IEEE Transactions on Signal Processing.

[25]  Daniel Pérez Palomar,et al.  Distributed nonconvex multiagent optimization over time-varying networks , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[28]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[29]  BoydStephen,et al.  Distributed average consensus with least-mean-square deviation , 2007 .

[30]  Sheng Zhong,et al.  A privacy-preserving algorithm for distributed training of neural network ensembles , 2012, Neural Computing and Applications.

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  Marc Moonen,et al.  Joint DOA and multi-pitch estimation based on subspace techniques , 2012, EURASIP J. Adv. Signal Process..

[33]  Qing Zhao,et al.  Distributed Learning in Wireless Sensor Networks , 2007 .

[34]  Ramon Martí,et al.  Secure Integration of Distributed Medical Data Using Mobile Agents , 2006, IEEE Intelligent Systems.

[35]  Ali H. Sayed,et al.  Sparse Distributed Learning Based on Diffusion Adaptation , 2012, IEEE Transactions on Signal Processing.

[36]  Stephen P. Boyd,et al.  Distributed average consensus with least-mean-square deviation , 2007, J. Parallel Distributed Comput..

[37]  Sonia Martínez,et al.  Discrete-time dynamic average consensus , 2010, Autom..

[38]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[39]  William J. Blackwell,et al.  Neural network Jacobian analysis for high-resolution profiling of the atmosphere , 2012, EURASIP Journal on Advances in Signal Processing.

[40]  H. Vincent Poor,et al.  Distributed learning in wireless sensor networks , 2005, IEEE Signal Processing Magazine.

[41]  Nicholas R. Jennings,et al.  Bounded approximate decentralised coordination via the max-sum algorithm , 2009, Artif. Intell..

[42]  Volkan Cevher,et al.  Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics , 2014, IEEE Signal Processing Magazine.

[43]  Simone Scardapane,et al.  Distributed semi-supervised support vector machines , 2016, Neural Networks.

[44]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[45]  Sanjeev R. Kulkarni,et al.  Robust and Low Complexity Distributed Kernel Least Squares Learning in Sensor Networks , 2010, IEEE Signal Processing Letters.

[46]  Zoran Obradovic,et al.  Boosting Algorithms for Parallel and Distributed Learning , 2022 .

[47]  Yi Ding,et al.  Adaptive Subgradient Methods for Online AUC Maximization , 2016, ArXiv.

[48]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[49]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[50]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[51]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[52]  Mark W. Schmidt,et al.  Graphical model structure learning using L₁-regularization , 2010 .

[53]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[54]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[55]  Ali Miri,et al.  Privacy-preserving back-propagation and extreme learning machine algorithms , 2012, Data Knowl. Eng..

[56]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[57]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[58]  Simone Scardapane,et al.  Parallel and distributed training of neural networks via successive convex approximation , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[59]  Jie Chen,et al.  Diffusion adaptation over networks with kernel least-mean-square , 2015, 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[60]  Emilio Parrado-Hernández,et al.  Distributed support vector machines , 2006, IEEE Trans. Neural Networks.

[61]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[62]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[63]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[64]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[65]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[66]  SayedAli Adaptation, Learning, and Optimization over Networks , 2014 .

[67]  Chunguang Li,et al.  Distributed Extreme Learning Machine for Nonlinear Learning over Network , 2015, Entropy.

[68]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[69]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[70]  Vwani P. Roychowdhury,et al.  Distributed Parallel Support Vector Machines in Strongly Connected Networks , 2008, IEEE Transactions on Neural Networks.