Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems

Stochastic gradient descent is a canonical tool for addressing stochastic optimization problems, and forms the bedrock of modern machine learning and statistics. In this work, we seek to balance the fact that attenuating step-size is required for exact asymptotic convergence with the fact that constant step-size learns faster in finite time up to an error. To do so, rather than fixing the mini-batch and the step-size at the outset, we propose a strategy to allow parameters to evolve adaptively. Specifically, the batch-size is set to be a piecewise-constant increasing sequence where the increase occurs when a suitable error criterion is satisfied. Moreover, the step-size is selected as that which yields the fastest convergence. The overall algorithm, two scale adaptive (TSA) scheme, is developed for both convex and non-convex stochastic optimization problems. It inherits the exact asymptotic convergence of stochastic gradient method. More importantly, the optimal error decreasing rate is achieved theoretically, as well as an overall reduction in computational cost. Experimentally, we observe that TSA attains a favorable tradeoff relative to standard SGD that fixes the mini-batch and the step-size, or simply allowing one to increase or decrease respectively.

[1]  Steven M. Melimis Numerical methods for stochastic processes , 1978 .

[2]  A. Ruszczynski,et al.  Stochastic approximation method with gradient averaging for unconstrained problems , 1983 .

[3]  W. Gardner Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique , 1984 .

[4]  Y. Wardi A stochastic algorithm using one sample point per iteration and diminishing stepsizes , 1989 .

[5]  Shun-ichi Amari,et al.  Backpropagation and stochastic gradient descent method , 1993, Neurocomputing.

[6]  V. John Mathews,et al.  A stochastic gradient adaptive filter with gradient adaptive step size , 1993, IEEE Trans. Signal Process..

[7]  Xuan Kong,et al.  Adaptive Signal Processing Algorithms: Stability and Performance , 1994 .

[8]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[9]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[10]  Paulo Sergio Ramirez,et al.  Fundamentals of Adaptive Filtering , 2002 .

[11]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[12]  Nicol N. Schraudolph,et al.  3D hand tracking by rapid stochastic gradient descent using a skinning model , 2004 .

[13]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[14]  J.C. Principe,et al.  From linear adaptive filtering to nonlinear information processing - The design and analysis of information processing systems , 2006, IEEE Signal Processing Magazine.

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Rick Chartrand,et al.  Exact Reconstruction of Sparse Signals via Nonconvex Minimization , 2007, IEEE Signal Processing Letters.

[17]  Huyen Pham,et al.  Continuous-time stochastic control and optimization with financial applications / Huyen Pham , 2009 .

[18]  Rui Seara,et al.  On the Constrained Stochastic Gradient Algorithm: Model, Performance, and Improved Version , 2009, IEEE Transactions on Signal Processing.

[19]  Alejandro Ribeiro,et al.  Ergodic Stochastic Optimization Algorithms for Wireless Communication and Networking , 2010, IEEE Transactions on Signal Processing.

[20]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[21]  Luca Maria Gambardella,et al.  Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[22]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[23]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[24]  Anand D. Sarwate,et al.  Stochastic gradient descent with differentially private updates , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[25]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Ali H. Sayed,et al.  Stochastic gradient descent with finite samples sizes , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[28]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[29]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[30]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[31]  Brian M. Sadler,et al.  Proximity Without Consensus in Online Multiagent Optimization , 2016, IEEE Transactions on Signal Processing.

[32]  Alfred O. Hero,et al.  A Survey of Stochastic Simulation and Optimization Methods in Signal Processing , 2015, IEEE Journal of Selected Topics in Signal Processing.

[33]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[34]  Javier Romero,et al.  Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[35]  Aryan Mokhtari,et al.  Large-scale nonconvex stochastic optimization by Doubly Stochastic Successive Convex approximation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[37]  Wotao Yin,et al.  On Nonconvex Decentralized Gradient Descent , 2016, IEEE Transactions on Signal Processing.

[38]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[39]  Deniz Gündüz,et al.  Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[40]  Aryan Mokhtari,et al.  A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning , 2016, J. Mach. Learn. Res..

[41]  Zhan Gao,et al.  Balancing Rates and Variance via Adaptive Batch-Sizes in First-Order Stochastic Optimization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).