Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent

First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress. We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by LINEARLY COUPLING the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.

[1]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[2]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[3]  Aharon Ben-Tal,et al.  Lectures on modern convex optimization , 1987 .

[4]  Noam Nisan,et al.  A parallel approximation algorithm for positive linear programming , 1993, STOC.

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  Éva Tardos,et al.  Fast Approximation Algorithms for Fractional Packing and Covering Problems , 1995, Math. Oper. Res..

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[8]  Danny Raz,et al.  Global optimization using local information with applications to flow control , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[9]  Neal E. Young,et al.  Sequential and parallel algorithms for mixed packing and covering , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[10]  Danny Raz,et al.  Fast, Distributed Approximation Algorithms for Positive Linear Programming with Applications to Flow Control , 2004, SIAM J. Comput..

[11]  Sanjeev Arora,et al.  Fast algorithms for approximate semidefinite programming using the multiplicative weights update method , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[12]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[13]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[14]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[15]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[16]  Yurii Nesterov,et al.  Accelerating the cubic regularization of Newton’s method on convex problems , 2005, Math. Program..

[17]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[18]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[19]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[20]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[21]  H. Brendan McMahan,et al.  A Unified View of Regularized Dual Averaging and Mirror Descent with Implicit Updates , 2010, 1009.3240.

[22]  Xinhua Zhang,et al.  New approximation algorithms for minimum enclosing convex shapes , 2009, SODA '11.

[23]  H. Brendan McMahan,et al.  Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[24]  Shang-Hua Teng,et al.  Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs , 2010, STOC '11.

[25]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[26]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[27]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[28]  Nisheeth K. Vishnoi,et al.  Approximating the exponential, the lanczos method and an Õ(m)-time spectral algorithm for balanced separator , 2011, STOC '12.

[29]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[30]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[31]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[32]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[33]  Aleksander Madry,et al.  Navigating Central Path with Electrical Flows: From Flows to Matchings, and Back , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[34]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[35]  Satish Rao,et al.  A new approach to computing maximum flows using electrical flows , 2013, STOC '13.

[36]  Jonah Sherman,et al.  Nearly Maximum Flows in Nearly Linear Time , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[37]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[38]  Yin Tat Lee,et al.  An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected Graphs, and its Multicommodity Generalizations , 2013, SODA.

[39]  Zeyuan Allen-Zhu,et al.  Nearly-Linear Time Packing and Covering LP Solver with Faster Convergence Rate Than O(1/" 2 ) , 2014, 1411.1124.

[40]  Zeyuan Allen Zhu,et al.  Nearly-Linear Time Positive LP Solver with Faster Convergence Rate , 2015, STOC.

[41]  Zeyuan Allen Zhu,et al.  Using Optimization to Break the Epsilon Barrier: A Faster and Simpler Width-Independent Algorithm for Solving Positive Linear Programs in Parallel , 2014, SODA.

[42]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[43]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[44]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[45]  Yurii Nesterov,et al.  Universal gradient methods for convex optimization problems , 2015, Math. Program..

[46]  Di Wang,et al.  Faster Parallel Solver for Positive Linear Programs via Dynamically-Bucketed Selective Coordinate Descent , 2015, ArXiv.

[47]  Zeyuan Allen Zhu,et al.  Katyusha: Accelerated Variance Reduction for Faster SGD , 2016, ArXiv.

[48]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[49]  Michael W. Mahoney,et al.  Approximating the Solution to Mixed Packing and Covering LPs in parallel Õ ( − 3 ) time , 2016 .

[50]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[51]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[52]  Di Wang,et al.  Unified Acceleration Method for Packing and Covering Problems via Diameter Reduction , 2015, ICALP.

[53]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[54]  Yin Tat Lee,et al.  Using Optimization to Obtain a Width-Independent, Parallel, Simpler, and Faster Positive SDP Solver , 2015, SODA.