论文信息 - Proximal-Proximal-Gradient Method

Proximal-Proximal-Gradient Method

In this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be coupled. We furthermore present a related stochastic variation, which we call stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions. We present many applications that can benefit from PPG and S-PPG and prove convergence for both methods. A key strength of PPG and S-PPG is, compared to existing methods, its ability to directly handle a large sum of non-differentiable non-separable functions with a constant stepsize independent of the number of functions. Such non-diminishing stepsizes allows them to be fast.

Ernest K. Ryu | W. Yin

[1] G. Minty. Monotone (nonlinear) operators in Hilbert space , 1962 .

[2] R. Glowinski,et al. Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[3] B. Mercier,et al. A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[4] Gregory B. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space , 1979 .

[5] P. McCullagh,et al. Generalized Linear Models , 1984 .

[6] H. Robbins,et al. A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[7] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[8] Patrick L. Combettes,et al. Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[9] R. Tibshirani,et al. Sparsity and smoothness via the fused lasso , 2005 .

[10] M. Yuan,et al. Model selection and estimation in regression with grouped variables , 2006 .

[11] Daniel Pérez Palomar,et al. Alternative Distributed Algorithms for Network Utility Maximization: Framework and Applications , 2007, IEEE Transactions on Automatic Control.

[12] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[13] R. Tibshirani,et al. Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[14] Emmanuel Barillot,et al. Classification of arrayCGH data using fused SVM , 2008, ISMB.

[15] John Langford,et al. Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[16] P. Zhao,et al. The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[17] Holger Hoefling. A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[18] Heinz H. Bauschke,et al. The Baillon-Haddad Theorem Revisited , 2009, 0906.0807.

[19] Jieping Ye,et al. An efficient algorithm for a class of fused lasso problems , 2010, KDD.

[20] Peter L. Bartlett,et al. Implicit Online Learning , 2010, ICML.

[21] Xi Chen,et al. Smoothing proximal gradient method for general structured sparse regression , 2010, The Annals of Applied Statistics.

[22] Eric P. Xing,et al. Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[23] Julien Mairal,et al. Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[24] Jieping Ye,et al. Moreau-Yosida Regularization for Grouped Tree Structure Learning , 2010, NIPS.

[25] R. Tibshirani,et al. A fused lasso latent feature model for analyzing multi-sample aCGH data. , 2011, Biostatistics.

[26] H. Brendan McMahan,et al. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[27] Heinz H. Bauschke,et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[28] Patrick L. Combettes,et al. Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[29] Francis R. Bach,et al. Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[30] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[31] Xiaohui Xie,et al. Split Bregman method for large scale fused Lasso , 2010, Comput. Stat. Data Anal..

[32] Pablo A. Parrilo,et al. Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[33] Dimitri P. Bertsekas,et al. Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[34] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[35] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[36] Jiayu Zhou,et al. Modeling disease progression via fused sparse group lasso , 2012, KDD.

[37] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[38] João M. F. Xavier,et al. D-ADMM: A Communication-Efficient Distributed Algorithm for Separable Optimization , 2012, IEEE Transactions on Signal Processing.

[39] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40] Rong Jin,et al. Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[41] Julien Mairal,et al. Optimization with First-Order Surrogate Functions , 2013, ICML.

[42] R. Tibshirani,et al. A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[43] Jieping Ye,et al. Efficient Methods for Overlapping Group Lasso , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44] Tianbao Yang,et al. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[45] Mohamed-Jalal Fadili,et al. A Generalized Forward-Backward Splitting , 2011, SIAM J. Imaging Sci..

[46] Justin Domke,et al. Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[47] Thomas Hofmann,et al. Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[48] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[49] Edoardo M. Airoldi,et al. Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[50] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[51] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[52] E. Airoldi,et al. Asymptotic and finite-sample properties of estimators based on stochastic gradients , 2014 .

[53] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[54] Atsushi Nitanda,et al. Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[55] Pradeep Ravikumar,et al. Elementary Estimators for High-Dimensional Linear Regression , 2014, ICML.

[56] Dimitri P. Bertsekas,et al. Incremental constraint projection methods for variational inequalities , 2014, Math. Program..

[57] Stephen P. Boyd,et al. Network Lasso: Clustering and Optimization in Large Graphs , 2015, KDD.

[58] Patrick L. Combettes,et al. Stochastic Quasi-Fejér Block-Coordinate Fixed Point Iterations with Random Sweeping , 2014 .

[59] Damek Davis,et al. A Three-Operator Splitting Scheme and its Optimization Applications , 2015, 1504.01032.

[60] Shai Shalev-Shwartz,et al. SDCA without Duality , 2015, ArXiv.

[61] Stephen P. Boyd,et al. A Primer on Monotone Operator Methods , 2015 .

[62] Ohad Shamir,et al. Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[63] Julien Mairal,et al. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[64] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[65] Pascal Bianchi,et al. A stochastic proximal point algorithm for total variation regularization over large scale graphs , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[66] Dimitri P. Bertsekas,et al. Stochastic First-Order Methods with Random Constraint Projection , 2016, SIAM J. Optim..

[67] Shai Shalev-Shwartz,et al. SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[68] Pascal Bianchi,et al. Ergodic Convergence of a Stochastic Proximal Point Algorithm , 2015, SIAM J. Optim..

[69] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[70] Wotao Yin,et al. Parallel Multi-Block ADMM with o(1 / k) Convergence , 2013, Journal of Scientific Computing.

[71] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[72] Yi Zhou,et al. An optimal randomized incremental gradient method , 2015, Mathematical Programming.