Open Problem: Do Good Algorithms Necessarily Query Bad Points?

Folklore results in the theory of Stochastic Approximation indicates the (minimax) optimality of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951) with polynomially decaying step sizes and iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992) for classes of stochastic convex optimization. Basing of these folkore results and some recent developments, this manuscript considers a more subtle question: does any algorithm necessarily (information theoretically) have to query iterates that are sub-optimal infinitely often?

[1]  Prateek Jain,et al.  Making the Last Iterate of SGD Information Theoretically Optimal , 2019, COLT.

[2]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[3]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[4]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[5]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[6]  Sham M. Kakade,et al.  The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure , 2019, NeurIPS.

[7]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[8]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[9]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[10]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[11]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[12]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[13]  H. Robbins A Stochastic Approximation Method , 1951 .

[14]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically , 2018, NIPS 2018.

[15]  Nicholas J. A. Harvey,et al.  Tight Analyses for Non-Smooth Stochastic Gradient Descent , 2018, COLT.