A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

[1]  Haipeng Luo,et al.  Achieving All with No Parameters: AdaNormalHedge , 2015, COLT.

[2]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[3]  Akshay Krishnamurthy,et al.  Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[4]  Haipeng Luo,et al.  Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits , 2016, NIPS.

[5]  Rong Jin,et al.  Dynamic Regret of Strongly Adaptive Methods , 2017, ICML.

[6]  Omar Besbes,et al.  Non-Stationary Stochastic Optimization , 2013, Oper. Res..

[7]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8]  P. Auer,et al.  Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes , 2018 .

[9]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[10]  Elad Hazan,et al.  The computational power of optimization in online learning , 2015, STOC.

[11]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[12]  Zohar S. Karnin,et al.  Multi-armed Bandits: Competing with Optimal Sequences , 2016, NIPS.

[13]  David Simchi-Levi,et al.  Learning to Optimize under Non-Stationarity , 2018, AISTATS.

[14]  Seshadhri Comandur,et al.  Efficient learning algorithms for changing environments , 2009, ICML '09.

[15]  Jinfeng Yi,et al.  Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient , 2016, ICML.

[16]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[17]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[18]  Jinfeng Yi,et al.  Improved Dynamic Regret for Non-degenerate Functions , 2016, NIPS.

[19]  Shahin Shahrampour,et al.  Online Optimization : Competing with Dynamic Comparators , 2015, AISTATS.

[20]  Rebecca Willett,et al.  Online Learning for Changing Environments using Coin Betting , 2017, ArXiv.

[21]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[22]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[23]  Wouter M. Koolen,et al.  Putting Bayes to sleep , 2012, NIPS.

[24]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[25]  Fang Liu,et al.  A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.

[26]  Haipeng Luo,et al.  Efficient Contextual Bandits in Non-stationary Worlds , 2017, COLT.

[27]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[28]  Chen-Yu Wei,et al.  Tracking the Best Expert in Non-stationary Stochastic Environments , 2017, NIPS.

[29]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[30]  Manfred K. Warmuth,et al.  Tracking a Small Set of Experts by Mixing Past Posteriors , 2003, J. Mach. Learn. Res..