Online convex optimization in the bandit setting: gradient descent without a gradient

We study a general online convex optimization problem. We have a convex set <i>S</i> and an unknown sequence of cost functions <i>c</i><inf>1</inf>, <i>c</i><inf>2</inf>,..., and in each period, we choose a feasible point <i>x<inf>t</inf></i> in <i>S</i>, and learn the cost <i>c<inf>t</inf></i>(<i>x<inf>t</inf></i>). If the function <i>c<inf>t</inf></i> is also revealed after each period then, as Zinkevich shows in [25], gradient descent can be used on these functions to get regret bounds of <i>O</i>(√<i>n</i>). That is, after <i>n</i> rounds, the total cost incurred will be <i>O</i>(√<i>n</i>) more than the cost of the best single feasible decision chosen with the benefit of hindsight, min<inf><i>x</i></inf> Σ <i>ct</i>(<i>x</i>).We extend this to the "bandit" setting, where, in each period, only the cost <i>c<inf>t</inf></i>(<i>x<inf>t</inf></i>) is revealed, and bound the expected regret as <i>O</i>(<i>n</i><sup>3/4</sup>).Our approach uses a simple approximation of the gradient that is computed from evaluating <i>c<inf>t</inf></i> at a single (random) point. We show that this biased estimate is sufficient to approximate gradient descent on the sequence of functions. In other words, it is possible to use gradient descent without seeing anything more than the value of the functions at a single point. The guarantees hold even in the most general case: online against an adaptive adversary.For the online linear optimization problem [15], algorithms with low regrets in the bandit setting have recently been given against oblivious [1] and adaptive adversaries [19]. In contrast to these algorithms, which distinguish between explicit <i>explore</i> and <i>exploit</i> periods, our algorithm can be interpreted as doing a small amount of exploration in each period.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  L. G. H. Cijan A polynomial algorithm in linear programming , 1979 .

[3]  L. Khachiyan Polynomial algorithms in linear programming , 1980 .

[4]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[5]  V. Milman,et al.  Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space , 1989 .

[6]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[7]  T. Cover Universal Portfolios , 1996 .

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[10]  Miklós Simonovits,et al.  Random walks and an O*(n5) volume algorithm for convex bodies , 1997, Random Struct. Algorithms.

[11]  James C. Spall,et al.  A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[12]  M. Simonovits,et al.  Random walks and an O * ( n 5 ) volume algorithm for convex bodies , 1997 .

[13]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[14]  A. Frieze,et al.  Log-Sobolev inequalities and sampling from log-concave distributions , 1999 .

[15]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[16]  Santosh S. Vempala,et al.  Efficient algorithms for universal portfolios , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[17]  O. Granichin Randomized Algorithms for Stochastic Approximation under Arbitrary Disturbances , 2002 .

[18]  Manfred K. Warmuth,et al.  Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[19]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[20]  Santosh S. Vempala,et al.  Simulated annealing in convex bodies and an O*(n/sup 4/) volume algorithm , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[21]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[22]  Santosh S. Vempala,et al.  Solving convex programs by random walks , 2004, JACM.

[23]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[24]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[25]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[26]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[27]  Santosh S. Vempala,et al.  Simulated annealing in convex bodies and an O*(n4) volume algorithm , 2006, J. Comput. Syst. Sci..

[28]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .