Proximal-Proximal-Gradient Method

In this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be coupled. We furthermore present a related stochastic variation, which we call stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions. We present many applications that can benefit from PPG and S-PPG and prove convergence for both methods. A key strength of PPG and S-PPG is, compared to existing methods, its ability to directly handle a large sum of non-differentiable non-separable functions with a constant stepsize independent of the number of functions. Such non-diminishing stepsizes allows them to be fast.

[1]  G. Minty Monotone (nonlinear) operators in Hilbert space , 1962 .

[2]  R. Glowinski,et al.  Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[3]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[4]  Gregory B. Passty Ergodic convergence to a zero of the sum of monotone operators in Hilbert space , 1979 .

[5]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[6]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[9]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[10]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[11]  Daniel Pérez Palomar,et al.  Alternative Distributed Algorithms for Network Utility Maximization: Framework and Applications , 2007, IEEE Transactions on Automatic Control.

[12]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[13]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[14]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[15]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[16]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[17]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[18]  Heinz H. Bauschke,et al.  The Baillon-Haddad Theorem Revisited , 2009, 0906.0807.

[19]  Jieping Ye,et al.  An efficient algorithm for a class of fused lasso problems , 2010, KDD.

[20]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[21]  Xi Chen,et al.  Smoothing proximal gradient method for general structured sparse regression , 2010, The Annals of Applied Statistics.

[22]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[23]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[24]  Jieping Ye,et al.  Moreau-Yosida Regularization for Grouped Tree Structure Learning , 2010, NIPS.

[25]  R. Tibshirani,et al.  A fused lasso latent feature model for analyzing multi-sample aCGH data. , 2011, Biostatistics.

[26]  H. Brendan McMahan,et al.  Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[27]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[28]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[29]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[30]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[31]  Xiaohui Xie,et al.  Split Bregman method for large scale fused Lasso , 2010, Comput. Stat. Data Anal..

[32]  Pablo A. Parrilo,et al.  Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[33]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[34]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[35]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[36]  Jiayu Zhou,et al.  Modeling disease progression via fused sparse group lasso , 2012, KDD.

[37]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[38]  João M. F. Xavier,et al.  D-ADMM: A Communication-Efficient Distributed Algorithm for Separable Optimization , 2012, IEEE Transactions on Signal Processing.

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Rong Jin,et al.  Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[41]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[42]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[43]  Jieping Ye,et al.  Efficient Methods for Overlapping Group Lasso , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[45]  Mohamed-Jalal Fadili,et al.  A Generalized Forward-Backward Splitting , 2011, SIAM J. Imaging Sci..

[46]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[47]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[48]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[49]  Edoardo M. Airoldi,et al.  Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[50]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[51]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[52]  E. Airoldi,et al.  Asymptotic and finite-sample properties of estimators based on stochastic gradients , 2014 .

[53]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[54]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[55]  Pradeep Ravikumar,et al.  Elementary Estimators for High-Dimensional Linear Regression , 2014, ICML.

[56]  Dimitri P. Bertsekas,et al.  Incremental constraint projection methods for variational inequalities , 2014, Math. Program..

[57]  Stephen P. Boyd,et al.  Network Lasso: Clustering and Optimization in Large Graphs , 2015, KDD.

[58]  Patrick L. Combettes,et al.  Stochastic Quasi-Fejér Block-Coordinate Fixed Point Iterations with Random Sweeping , 2014 .

[59]  Damek Davis,et al.  A Three-Operator Splitting Scheme and its Optimization Applications , 2015, 1504.01032.

[60]  Shai Shalev-Shwartz,et al.  SDCA without Duality , 2015, ArXiv.

[61]  Stephen P. Boyd,et al.  A Primer on Monotone Operator Methods , 2015 .

[62]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[63]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[64]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[65]  Pascal Bianchi,et al.  A stochastic proximal point algorithm for total variation regularization over large scale graphs , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[66]  Dimitri P. Bertsekas,et al.  Stochastic First-Order Methods with Random Constraint Projection , 2016, SIAM J. Optim..

[67]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[68]  Pascal Bianchi,et al.  Ergodic Convergence of a Stochastic Proximal Point Algorithm , 2015, SIAM J. Optim..

[69]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[70]  Wotao Yin,et al.  Parallel Multi-Block ADMM with o(1 / k) Convergence , 2013, Journal of Scientific Computing.

[71]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[72]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.