Asynchronous Proximal Stochastic Gradient Algorithm for Composition Optimization Problems

In machine learning research, many emerging applications can be (re)formulated as the composition optimization problem with nonsmooth regularization penalty. To solve this problem, traditional stochastic gradient descent (SGD) algorithm and its variants either have low convergence rate or are computationally expensive. Recently, several stochastic composition gradient algorithms have been proposed, however, these methods are still inefficient and not scalable to large-scale composition optimization problem instances. To address these challenges, we propose an asynchronous parallel algorithm, named Async-ProxSCVR, which effectively combines asynchronous parallel implementation and variance reduction method. We prove that the algorithm admits the fastest convergence rate for both strongly convex and general nonconvex cases. Furthermore, we analyze the query complexity of the proposed algorithm and prove that linear speedup is accessible when we increase the number of processors. Finally, we evaluate our algorithm Async-ProxSCVR on two representative composition optimization problems including value function evaluation in reinforcement learning and sparse mean-variance optimization problem. Experimental results show that the algorithm achieves significant speedups and is much faster than existing compared methods.

[1]  Stephen J. Wright,et al.  Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[2]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[3]  Zhouchen Lin,et al.  Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization , 2017, AAAI.

[4]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[5]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[6]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[7]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[8]  Heng Huang,et al.  Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization , 2017, AAAI.

[9]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[10]  Mengdi Wang,et al.  Finite-sum Composition Optimization via Variance Reduced Gradient Descent , 2016, AISTATS.

[11]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[12]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[13]  Tie-Yan Liu,et al.  Asynchronous Stochastic Proximal Optimization Algorithms with Variance Reduction , 2016, AAAI.

[14]  Yue Yu,et al.  Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization , 2017, IJCAI.

[15]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[16]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[19]  Heng Huang,et al.  Accelerated Method for Stochastic Composition Optimization with Nonsmooth Regularization , 2017, AAAI.

[20]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[21]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[22]  Mengdi Wang,et al.  Accelerating Stochastic Composition Optimization , 2016, NIPS.

[23]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[24]  James T. Kwok,et al.  Asynchronous Distributed ADMM for Consensus Optimization , 2014, ICML.

[25]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[26]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[27]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .