Asynchronous Stochastic Proximal Optimization Algorithms with Variance Reduction

Regularized empirical risk minimization (R-ERM) is an important branch of machine learning, since it constrains the capacity of the hypothesis space and guarantees the generalization ability of the learning algorithm. Two classic proximal optimization algorithms, i.e., proximal stochastic gradient descent (ProxSGD) and proximal stochastic coordinate descent (ProxSCD) have been widely used to solve the R-ERM problem. Recently, variance reduction technique was proposed to improve ProxSGD and ProxSCD, and the corresponding ProxSVRG and ProxSVRCD have better convergence rate. These proximal algorithms with variance reduction technique have also achieved great success in applications at small and moderate scales. However, in order to solve large-scale R-ERM problems and make more practical impacts, the parallel version of these algorithms are sorely needed. In this paper, we propose asynchronous ProxSVRG (Async-ProxSVRG) and asynchronous ProxSVRCD (Async-ProxSVRCD) algorithms, and prove that Async-ProxSVRG can achieve near linear speedup when the training data is sparse, while Async-ProxSVRCD can achieve near linear speedup regardless of the sparse condition, as long as the number of block partitions are appropriately set. We have conducted experiments on a regularized logistic regression task. The results verified our theoretical findings and demonstrated the practical efficiency of the asynchronous stochastic proximal algorithms with variance reduction.

[1]  Lin Xiao,et al.  Scaling Up Stochastic Dual Coordinate Ascent , 2015, KDD.

[2]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[3]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[4]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[5]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[6]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[7]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[8]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[11]  Stephen J. Wright,et al.  Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[12]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[13]  RéChristopher,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2015 .

[14]  Stephen J. Wright,et al.  An Asynchronous Parallel Randomized Kaczmarz Algorithm , 2014, ArXiv.

[15]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[16]  Yiming Wang,et al.  Accelerated Mini-batch Randomized Block Coordinate Descent Method , 2014, NIPS.

[17]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[18]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[19]  J. Koenderink Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[20]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[21]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.