Online Learning with Dependent Stochastic Feedback Graphs

A general framework for online learning with partial information is one where feedback graphs specify which losses can be observed by the learner. We study a challenging scenario where feedback graphs vary stochastically with time and, more importantly, where graphs and losses are dependent. This scenario appears in several realworld applications that we describe where the outcome of actions are correlated. We devise a new algorithm for this setting that exploits the stochastic properties of the graphs and that benefits from favorable regret guarantees. We present a detailed theoretical analysis of this algorithm, and also report the results of a series of experiments on real-world datasets, which show that our algorithm outperforms standard baselines for online learning with feedback graphs.

[1]  Atilla Eryilmaz,et al.  Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[2]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[3]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[4]  Yifan Wu,et al.  Online Learning with Gaussian Payoffs and Side Observations , 2015, NIPS.

[5]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[6]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[7]  Fang Liu,et al.  Information Directed Sampling for Stochastic Bandits with Graph Feedback , 2017, AAAI.

[8]  Noga Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[9]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[10]  Christos Dimitrakakis,et al.  Thompson Sampling for Stochastic Bandits with Graph Feedback , 2017, AAAI.

[11]  Claudio Gentile,et al.  Online Learning with Abstention , 2017, ICML.

[12]  Alan M. Frieze,et al.  On the independence number of random graphs , 1990, Discret. Math..

[13]  Tamir Hazan,et al.  Online Learning with Feedback Graphs Without the Graphs , 2016, ICML 2016.

[14]  Mehryar Mohri,et al.  Learning with Rejection , 2016, ALT.

[15]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[16]  S. Geer On Hoeffding's Inequality for Dependent Random Variables , 2002 .

[17]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[18]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19]  Jinwoo Shin,et al.  Multi-armed Bandit with Additional Observations , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[20]  Zheng Wen,et al.  Stochastic Online Learning with Probabilistic Graph Feedback , 2019, AAAI.

[21]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[22]  Jean-Yves Audibert,et al.  Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem , 2013, J. Mach. Learn. Res..

[23]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[24]  Noga Alon,et al.  Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback , 2014, SIAM J. Comput..

[25]  Claudio Gentile,et al.  Online Learning with Sleeping Experts and Feedback Graphs , 2019, ICML.

[26]  Michal Valko,et al.  Online Learning with Noisy Side Observations , 2016, AISTATS.

[27]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .