Mini-batch algorithms with online step size

Abstract Mini-batch algorithms have been proposed as a way to speed-up stochastic optimization methods and good results for mini-batch algorithms have been reported previously. A major issue with mini-batch algorithms is how to timely and readily acquire step size while running the algorithm. Usually, mini-batch algorithms employ a diminishing step size, or a best-tuned step size by mentor, which, in practice, are time consuming. To solve this problem, we propose using a hypergradient to compute an online step size (OSS) for mini-batch algorithms. Specifically, we incorporate online step size into advanced mini-batch algorithms, mini-batch nonconvex stochastic variance reduced gradient (MSVRG), thereby generating a new method, MSVRG-OSS. When computing step size in MSVRG-OSS, mini-batch samples are used. In addition, MSVRG-OSS, which needs little additional computation, requires only one extra copy of the original gradient to be stored in memory. We prove that MSVRG-OSS converges linearly in expectation and analyze its complexity. We present numerical results on problems arising with machine learning that indicate the proposed method shows great promise. We also show that, with slightly large batch samples, MSVRG-OSS is insensitive to the initial parameters, which are the key factor for controlling the performance of the algorithm.

[1]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[2]  Yunni Xia,et al.  Applying the learning rate adaptation to the matrix factorization based collaborative filtering , 2013, Knowl. Based Syst..

[3]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[4]  Abolfazl Gharaei,et al.  Modelling and optimal lot-sizing of integrated multi-level multi-wholesaler supply chains under the shortage and limited warehouse space: generalised outer approximation , 2019 .

[5]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[6]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[7]  V. J. Mathews,et al.  Stochastic gradient adaptive filters with gradient adaptive step sizes , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8]  Abolfazl Gharaei,et al.  Four-Echelon Integrated Supply Chain Model with Stochastic Constraints Under Shortage Condition , 2017 .

[9]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[10]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[11]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[12]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[13]  Danushka Bollegala,et al.  Dynamic feature scaling for online learning of binary classifiers , 2014, Knowl. Based Syst..

[14]  Sebastian Nowozin,et al.  Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[15]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[16]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[17]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[18]  Abolfazl Gharaei,et al.  Inventory model in a four-echelon integrated supply chain: modeling and optimization , 2017 .

[19]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[20]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[21]  H. Robbins A Stochastic Approximation Method , 1951 .

[22]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[23]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[24]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[25]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[26]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[27]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[28]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[29]  David W. Jacobs,et al.  Automated Inference with Adaptive Batches , 2017, AISTATS.

[30]  Thibault Langlois,et al.  Parameter adaptation in stochastic optimization , 1999 .

[31]  Abolfazl Gharaei,et al.  Provide a new method to determine effectiveness or performance rate of organization strategies based on Freeman model and using improved dimensional analysis method , 2016, 2016 12th International Conference on Industrial Engineering (ICIE).

[32]  Angelia Nedic,et al.  On stochastic gradient and subgradient methods with adaptive steplength sequences , 2011, Autom..

[33]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Max A. Viergever,et al.  Adaptive Stochastic Gradient Descent Optimisation for Image Registration , 2009, International Journal of Computer Vision.

[36]  Abolfazl Gharaei,et al.  Modeling and optimization of four-level integrated supply chain with the aim of determining the optimum stockpile and period length: sequential quadratic programming , 2017 .

[37]  S.C. Douglas,et al.  Adaptive step size techniques for decorrelation and blind source separation , 1998, Conference Record of Thirty-Second Asilomar Conference on Signals, Systems and Computers (Cat. No.98CH36284).

[38]  H. Kesten Accelerated Stochastic Approximation , 1958 .

[39]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[40]  Chao Deng,et al.  Selective maintenance scheduling under stochastic maintenance quality with multiple maintenance actions , 2018, Int. J. Prod. Res..

[41]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[42]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[43]  Cheng Wang,et al.  Random Barzilai-Borwein step size for mini-batch algorithms , 2018, Eng. Appl. Artif. Intell..

[44]  Kathryn A. Dowsland,et al.  Simulated Annealing , 1989, Encyclopedia of GIS.

[45]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Abolfazl Gharaei,et al.  An optimal integrated lot sizing policy of inventory in a bi-objective multi-level supply chain with stochastic constraints and imperfect products , 2018 .

[47]  Seyed Taghi Akhavan Niaki,et al.  Optimization of a multiproduct economic production quantity problem with stochastic constraints using sequential quadratic programming , 2015, Knowl. Based Syst..

[48]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[49]  Andrew G. Barto,et al.  Adaptive Step-Size for Online Temporal Difference Learning , 2012, AAAI.

[50]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[51]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[52]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[53]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[54]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[56]  V. John Mathews,et al.  A stochastic gradient adaptive filter with gradient adaptive step size , 1993, IEEE Trans. Signal Process..

[57]  George D. Magoulas,et al.  Learning Rate Adaptation in Stochastic Gradient Descent , 2001 .

[58]  Zhi-Quan Luo,et al.  On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks , 1991, Neural Computation.

[59]  Philipp Hennig,et al.  Probabilistic Line Searches for Stochastic Optimization , 2015, NIPS.

[60]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[61]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[62]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[63]  Abolfazl Gharaei,et al.  Optimization of rewards in single machine scheduling in the rewards-driven systems , 2015 .

[64]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[65]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[66]  Jorge Nocedal,et al.  A Multi-Batch L-BFGS Method for Machine Learning , 2016, NIPS.

[67]  Francisco Herrera,et al.  Principal Components Analysis Random Discretization Ensemble for Big Data , 2018, Knowl. Based Syst..

[68]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[69]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[70]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[71]  Boudewijn P. F. Lelieveldt,et al.  Fast Automatic Step Size Estimation for Gradient Descent Optimization of Image Registration , 2016, IEEE Transactions on Medical Imaging.

[72]  Jinfeng Yi,et al.  Stochastic Optimization for Kernel PCA , 2016, AAAI.

[73]  Wei Liu,et al.  Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size , 2017, AAAI.

[74]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[75]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[76]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[77]  Abolfazl Gharaei,et al.  Provide a practical approach for measuring the performance rate of organizational strategies , 2016, 2016 12th International Conference on Industrial Engineering (ICIE).

[78]  George N. Saridis,et al.  Learning Applied to Successive Approximation Algorithms , 1970, IEEE Trans. Syst. Sci. Cybern..

[79]  Cheng Wang,et al.  Mini-batch algorithms with Barzilai-Borwein update step , 2018, Neurocomputing.