Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality. Besides its simplicity, our approach enjoys several advantages. First, the obtained high-probability regret bounds are data-dependent and could be much smaller than the worst-case bounds, which resolves an open problem asked by Neu (2015). Second, resolving another open problem of Bartlett et al. (2008) and Abernethy and Rakhlin (2009), our approach leads to the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, while previous methods are either inefficient or only applicable to specific action sets. Finally, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem.

[1]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[2]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[3]  Haipeng Luo,et al.  A Closer Look at Small-loss Bounds for Bandits with Graph Feedback , 2020, COLT.

[4]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[5]  Haipeng Luo,et al.  Improved Path-length Regret Bounds for Bandits , 2019, COLT.

[6]  Jacob D. Abernethy,et al.  Online Learning via the Differential Privacy Lens , 2017, NeurIPS.

[7]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[8]  Haipeng Luo,et al.  Efficient Online Portfolio with Logarithmic Regret , 2018, NeurIPS.

[9]  Yuanzhi Li,et al.  Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[10]  Haipeng Luo,et al.  More Adaptive Algorithms for Adversarial Bandits , 2018, COLT.

[11]  Yuanzhi Li,et al.  Sparsity, variance and curvature in multi-armed bandits , 2017, ALT.

[12]  Éva Tardos,et al.  Small-loss bounds for online learning with partial information , 2017, COLT.

[13]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[14]  Yin Tat Lee,et al.  Kernel-based methods for bandit convex optimization , 2016, STOC.

[15]  Sebastian Pokutta,et al.  An efficient high-probability algorithm for Linear Bandits , 2016, ArXiv.

[16]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[17]  Elad Hazan,et al.  Volumetric Spanners: An Efficient Exploration Basis for Learning , 2013, J. Mach. Learn. Res..

[18]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[19]  Gergely Neu,et al.  First-order regret bounds for combinatorial semi-bandits , 2015, COLT.

[20]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[21]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[22]  Elad Hazan,et al.  Interior-Point Methods for Full-Information and Bandit Online Learning , 2012, IEEE Transactions on Information Theory.

[23]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[24]  Ambuj Tewari,et al.  Improved Regret Guarantees for Online Smooth Convex Optimization with Bandit Feedback , 2011, AISTATS.

[25]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[26]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[27]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[28]  Jacob D. Abernethy,et al.  Beating the adaptive bandit with high probability , 2009, 2009 Information Theory and Applications Workshop.

[29]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[30]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[31]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[32]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[33]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[34]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[35]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[36]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[37]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[38]  D. Freedman On Tail Probabilities for Martingales , 1975 .