Small-loss bounds for online learning with partial information

We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called "small-loss" $o(\alpha L^{\star})$ regret bounds with high probability, where $\alpha$ is the independence number of the graph, and $L^{\star}$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\tilde{O}(\sqrt{dL^{\star}})$ for actual regret, where $d$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $\tilde{O}(\sqrt{\kappa L^{\star}})$ regret guarantee for fixed feedback graphs with clique-partition number at most $\kappa$.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[3]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[4]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[5]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[6]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[7]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[8]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[9]  Haipeng Luo,et al.  Achieving All with No Parameters: AdaNormalHedge , 2015, COLT.

[10]  Gergely Neu,et al.  First-order regret bounds for combinatorial semi-bandits , 2015, COLT.

[11]  Noga Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[12]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[13]  Karthik Sridharan,et al.  Online Non-Parametric Regression , 2014, COLT.

[14]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[15]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[16]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[17]  Gábor Lugosi,et al.  Minimizing regret with label efficient prediction , 2004, IEEE Transactions on Information Theory.

[18]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[19]  Avrim Blum,et al.  Routing without regret: on convergence to nash equilibria of regret-minimizing algorithms in routing games , 2006, PODC '06.

[20]  Yuanzhi Li,et al.  Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[21]  Mohammad Taghi Hajiaghayi,et al.  Regret minimization and the price of total anarchy , 2008, STOC.

[22]  Ambuj Tewari,et al.  Online learning via sequential complexities , 2010, J. Mach. Learn. Res..

[23]  Tim Roughgarden,et al.  Minimizing Regret with Multiple Reserves , 2016, EC.

[24]  Seshadhri Comandur,et al.  Efficient learning algorithms for changing environments , 2009, ICML '09.

[25]  Avrim Blum,et al.  Near-optimal online auctions , 2005, SODA '05.

[26]  Karthik Sridharan,et al.  On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities , 2015, COLT.

[27]  Amit Daniely,et al.  Strongly Adaptive Online Learning , 2015, ICML.

[28]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[29]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[30]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[31]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[32]  Gábor Lugosi,et al.  Mathematics of operations research , 1998 .

[33]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[34]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[35]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[36]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[37]  Noga Alon,et al.  Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback , 2014, SIAM J. Comput..

[38]  Michal Valko,et al.  Online Learning with Noisy Side Observations , 2016, AISTATS.

[39]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[40]  Christos Dimitrakakis,et al.  Thompson Sampling for Stochastic Bandits with Graph Feedback , 2017, AAAI.

[41]  Éva Tardos,et al.  Learning and Efficiency in Games with Dynamic Population , 2015, SODA.

[42]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[43]  Claudio Gentile,et al.  Regret Minimization for Reserve Prices in Second-Price Auctions , 2013, IEEE Transactions on Information Theory.

[44]  T. Cover Universal Portfolios , 1996 .

[45]  Akshay Krishnamurthy,et al.  Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[46]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[47]  John Langford,et al.  Open Problem: First-Order Regret Bounds for Contextual Bandits , 2017, COLT.

[48]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[49]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[50]  Tamir Hazan,et al.  Online Learning with Feedback Graphs Without the Graphs , 2016, ICML 2016.