SignProx: One-bit Proximal Algorithm for Nonconvex Stochastic Optimization

Stochastic gradient descent (SGD) is one of the most widely used optimization methods for parallel and distributed processing of large datasets. One of the key limitations of distributed SGD is the need to regularly communicate the gradients between different computation nodes. To reduce this communication bottleneck, recent work has considered a one-bit variant of SGD, where only the sign of each gradient element is used in optimization. In this paper, we extend this idea by proposing a stochastic variant of the proximal-gradient method that also uses one-bit per update element. We prove the theoretical convergence of the method for non-convex optimization under a set of explicit assumptions. Our results indicate that the compressed method can match the convergence rate of the uncompressed one, making the proposed method potentially appealing for distributed processing of large datasets.

[1]  Robert D. Nowak,et al.  An EM algorithm for wavelet-based image restoration , 2003, IEEE Trans. Image Process..

[2]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[3]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[4]  Demetri Psaltis,et al.  Optical Tomographic Image Reconstruction Based on Beam Propagation and Sparse Regularization , 2016, IEEE Transactions on Computational Imaging.

[5]  J. Moreau Proximité et dualité dans un espace hilbertien , 1965 .

[6]  Ulugbek Kamilov,et al.  A Parallel Proximal Algorithm for Anisotropic Total Variation Minimization , 2017, IEEE Transactions on Image Processing.

[7]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[8]  Michael K. Ng,et al.  Solving Constrained Total-variation Image Restoration and Reconstruction Problems via Alternating Direction Methods , 2010, SIAM J. Sci. Comput..

[9]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[10]  José M. Bioucas-Dias,et al.  A New TwIST: Two-Step Iterative Shrinkage/Thresholding Algorithms for Image Restoration , 2007, IEEE Transactions on Image Processing.

[11]  Marc Teboulle,et al.  Fast Gradient-Based Algorithms for Constrained Total Variation Image Denoising and Deblurring Problems , 2009, IEEE Transactions on Image Processing.

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sundeep Rangan,et al.  Compressive Phase Retrieval via Generalized Approximate Message Passing , 2014, IEEE Transactions on Signal Processing.

[15]  Dimitri P. Bertsekas,et al.  On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators , 1992, Math. Program..

[16]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[17]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[18]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[19]  Antonin Chambolle,et al.  A l1-Unified Variational Framework for Image Restoration , 2004, ECCV.

[20]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[21]  Yaoliang Yu,et al.  Better Approximation and Faster Algorithm Using the Proximal Average , 2013, NIPS.

[22]  Marc Teboulle,et al.  Gradient-based algorithms with applications to signal-recovery problems , 2010, Convex Optimization in Signal Processing and Communications.

[23]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[24]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[25]  Heinz H. Bauschke,et al.  The Proximal Average: Basic Theory , 2008, SIAM J. Optim..

[26]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[27]  H. Robbins A Stochastic Approximation Method , 1951 .

[28]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[29]  Yonina C. Eldar,et al.  Phase Retrieval with Application to Optical Imaging , 2014, ArXiv.

[30]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.