Analysis of Gradient Descent Methods With Nondiminishing Bounded Errors

The main aim of this paper is to provide an analysis of gradient descent (<inline-formula><tex-math notation="LaTeX"> $\text{GD}$</tex-math></inline-formula>) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of <inline-formula><tex-math notation="LaTeX">$\text{GD}$</tex-math></inline-formula> with bounded (possibly) nondiminishing gradient errors. In addition to ensuring stability, such an algorithm is shown to converge to a small neighborhood of the minimum set, which depends on the gradient errors. It is worth noting that the main result of this paper can be used to show that <inline-formula><tex-math notation="LaTeX">$\text{GD}$</tex-math> </inline-formula> with asymptotically vanishing errors indeed converges to the minimum set. The results presented herein are not only more general when compared to previous results, but our analysis of <italic><inline-formula> <tex-math notation="LaTeX">$\text{GD}$</tex-math></inline-formula> with errors</italic> is new to the literature to the best of our knowledge. Our work extends the contributions of Mangasarian and Solodov, Bertsekas and Tsitsiklis, and Tadić and Doucet. Using our framework, a simple yet effective implementation of <inline-formula> <tex-math notation="LaTeX">$\text{GD}$</tex-math></inline-formula> using simultaneous perturbation stochastic approximations, with constant sensitivity parameters, is presented. Another important improvement over many previous results is that there are no “additional” restrictions imposed on the step sizes. In machine learning applications where step sizes are related to learning rates, our assumptions, unlike those of other papers, do not affect these learning rates. Finally, we present experimental results to validate our theory.

[1]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[2]  J. Aubin,et al.  Differential inclusions set-valued maps and viability theory , 1984 .

[3]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[4]  O. Mangasarian,et al.  Serial and parallel backpropagation convergence via nonmonotone perturbed minimization , 1994 .

[5]  M. Hurley Chain recurrence, semiflows, and gradients , 1995 .

[6]  M. Benaïm A Dynamical System Approach to Stochastic Approximations , 1996 .

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  James C. Spall,et al.  Adaptive stochastic approximation by the simultaneous perturbation method , 2000, IEEE Trans. Autom. Control..

[9]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[10]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[11]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[12]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[13]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[14]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[15]  Arnaud Doucet,et al.  Asymptotic bias of stochastic gradient search , 2011, IEEE Conference on Decision and Control and European Control Conference.

[16]  Josef Hofbauer,et al.  Perturbations of Set-Valued Dynamical Systems, with Applications to Game Theory , 2012, Dyn. Games Appl..