Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

We present and study a partial-information model of online learning, where a decision maker repeatedly chooses from a finite set of actions, and observes some subset of the associated losses. This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions. Moreover, it generalizes and interpolates between the well studied full-information setting (where all losses are revealed) and the bandit setting (where only the loss of the action chosen by the player is revealed). We provide several algorithms addressing different variants of our setting, and provide tight regret bounds depending on combinatorial properties of the information feedback structure.

[1]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[2]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[3]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[4]  Alan M. Frieze,et al.  On the independence number of random graphs , 1990, Discret. Math..

[5]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[6]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[7]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  Alan M. Frieze,et al.  Algorithmic theory of random graphs , 1997, Random Struct. Algorithms.

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[12]  G. Stoltz Information incomplète et regret interne en prédiction de suites inidividuelles , 2005 .

[13]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[14]  Sanjeev R. Kulkarni,et al.  Arbitrary side observations in bandit problems , 2005, Adv. Appl. Math..

[15]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[16]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[17]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[18]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[19]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[20]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[21]  A. Said,et al.  How social relationships affect user similarities , 2010 .

[22]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[23]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[24]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[25]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[26]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[27]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[28]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[29]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[30]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[31]  Russell Greiner,et al.  Online Learning with Costly Features and Labels , 2013, NIPS.

[32]  Rémi Munos,et al.  Spectral Bandits for Smooth Graph Functions , 2014, ICML.

[33]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[34]  Atilla Eryilmaz,et al.  Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[35]  Venkatesh Saligrama,et al.  Cost effective algorithms for spectral bandits , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[36]  Noga Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[37]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[38]  Yifan Wu,et al.  Online Learning with Gaussian Payoffs and Side Observations , 2015, NIPS.

[39]  F. R. Rosendaal,et al.  Prediction , 2015, Journal of thrombosis and haemostasis : JTH.

[40]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[41]  Michal Valko,et al.  Online Learning with Noisy Side Observations , 2016, AISTATS.

[42]  Michal Valko,et al.  Revealing Graph Bandits for Maximizing Local Influence , 2016, AISTATS.

[43]  Michal Valko,et al.  Online learning with Erdos-Renyi side-observation graphs , 2016, UAI.

[44]  Tamir Hazan,et al.  Online Learning with Feedback Graphs Without the Graphs , 2016, ICML 2016.

[45]  Claudio Gentile,et al.  Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning , 2017, COLT.