Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar a. We prove that the regret of NEWTRON is O(log T) when α is a constant that does not vary with horizon T, and at most O(T2/3) if α is allowed to increase to infinity with T. For α = O(log T), the regret is bounded by O(√T), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant.

[1]  Jacob D. Abernethy,et al.  An Efficient Bandit Algorithm for sqrt(T) Regret in Online Multiclass Prediction? , 2009, COLT.

[2]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[5]  P. Bartlett,et al.  Closing the gap between bandit and full-information online optimization : high-probability regret bound , 2007 .

[6]  Baruch Awerbuch,et al.  Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[7]  J. Abernethy,et al.  An Efficient Bandit Algorithm for √ T-Regret in Online Multiclass Prediction ? , 2009 .

[8]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[9]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[10]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[11]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[12]  Koby Crammer,et al.  Multiclass classification with bandit feedback using adaptive regularization , 2012, Machine Learning.

[13]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[14]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[15]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.