Performance limits of stochastic sub-gradient learning, part II: Multi-agent case

Abstract The analysis in Part I [1] revealed interesting properties for subgradient learning algorithms in the context of stochastic optimization. These algorithms are used when the risk functions are non-smooth or involve non-differentiable components. They have been long recognized as being slow converging methods. However, it was revealed in Part I [1] that the rate of convergence becomes linear for stochastic optimization problems, with the error iterate converging at an exponential rate αi to within an O ( μ ) − neighborhood of the optimizer, for some α ∈ (0, 1) and small step-size μ. The conclusion was established under weaker assumptions than the prior literature and, moreover, several important problems were shown to satisfy these weaker assumptions automatically. These results revealed that sub-gradient learning methods have more favorable behavior than originally thought. The results of Part I [1] were exclusive to single-agent adaptation. The purpose of current Part II is to examine the implications of these discoveries when a collection of networked agents employs subgradient learning as their cooperative mechanism. The analysis will show that, despite the coupled dynamics that arises in a networked scenario, the agents are still able to attain linear convergence in the stochastic case; they are also able to reach agreement within O(μ) of the optimizer.

[1]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[4]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[5]  James D. Hamilton Time Series Analysis , 1994 .

[6]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Naum Zuselevich Shor,et al.  Minimization Methods for Non-Differentiable Functions , 1985, Springer Series in Computational Mathematics.

[9]  Ali H. Sayed,et al.  On the Learning Behavior of Adaptive Networks—Part I: Transient Analysis , 2013, IEEE Transactions on Information Theory.

[10]  Ali H. Sayed,et al.  Diffusion LMS Strategies for Distributed Estimation , 2010, IEEE Transactions on Signal Processing.

[11]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[12]  Marc Teboulle,et al.  Fast Gradient-Based Algorithms for Constrained Total Variation Image Denoising and Deblurring Problems , 2009, IEEE Transactions on Image Processing.

[13]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[14]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[15]  Ali H. Sayed,et al.  Performance limits of single-agent and multi-agent sub-gradient stochastic learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[17]  Wenwu Yu,et al.  Distributed Consensus Filtering in Sensor Networks , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[18]  Roger J.-B. Wets,et al.  Chapter VIII Stochastic programming , 1989 .

[19]  Ali H. Sayed,et al.  Performance limits of stochastic sub-gradient learning, Part I: Single agent case , 2015, Signal Process..

[20]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[21]  Ali H. Sayed,et al.  Stability and Performance Limits of Adaptive Primal-Dual Networks , 2014, IEEE Transactions on Signal Processing.

[22]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[23]  Isao Yamada,et al.  A sparse adaptive filtering using time-varying soft-thresholding techniques , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Ali H. Sayed,et al.  Sparse diffusion LMS for distributed adaptive estimation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[26]  R. Wets,et al.  Stochastic programming , 1989 .

[27]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[28]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[29]  Alfred O. Hero,et al.  Sparse LMS for system identification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[31]  Soummya Kar,et al.  Distributed Consensus Algorithms in Sensor Networks With Imperfect Communication: Link Failures and Channel Noise , 2007, IEEE Transactions on Signal Processing.

[32]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[33]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[34]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[35]  Ali H. Sayed,et al.  Diffusion Strategies Outperform Consensus Strategies for Distributed Estimation Over Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[36]  Sergios Theodoridis,et al.  Online Sparse System Identification and Signal Reconstruction Using Projections Onto Weighted $\ell_{1}$ Balls , 2010, IEEE Transactions on Signal Processing.

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  Ali H. Sayed,et al.  On the Learning Behavior of Adaptive Networks—Part II: Performance Analysis , 2013, IEEE Transactions on Information Theory.

[39]  Donald L. Duttweiler,et al.  Proportionate normalized least-mean-squares adaptation in echo cancelers , 2000, IEEE Trans. Speech Audio Process..

[40]  S. Haykin Adaptive Filters , 2007 .

[41]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[42]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[43]  Sergios Theodoridis,et al.  A Sparsity Promoting Adaptive Algorithm for Distributed Learning , 2012, IEEE Transactions on Signal Processing.

[44]  Ali H. Sayed,et al.  Excess-Risk of Distributed Stochastic Learners , 2013, IEEE Transactions on Information Theory.

[45]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[46]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[47]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[48]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[49]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[50]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[51]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[52]  Ali H. Sayed,et al.  Sparse Distributed Learning Based on Diffusion Adaptation , 2012, IEEE Transactions on Signal Processing.

[53]  Zhaoyang Zhang,et al.  Diffusion Sparse Least-Mean Squares Over Networks , 2012, IEEE Transactions on Signal Processing.

[54]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.