Graphical Models for Bandit Problems

We introduce a rich class of graphical models for multi-armed bandit problems that permit both the state or context space and the action space to be very large, yet succinctly specify the payoffs for any context-action pair. Our main result is an algorithm for such models whose regret is bounded by the number of parameters and whose running time depends only on the treewidth of the graph substructure induced by the action space.

[1]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[2]  F. Barahona On the computational complexity of Ising spin glass models , 1982 .

[3]  J. Abernethy,et al.  An Efficient Bandit Algorithm for √ T-Regret in Online Multiclass Prediction ? , 2009 .

[4]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[5]  Martin Pál,et al.  Contextual Multi-Armed Bandits , 2010, AISTATS.

[6]  Claudio Gentile,et al.  Robust bounds for classification via selective sampling , 2009, ICML '09.

[7]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[8]  Philip M. Long,et al.  Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[9]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[10]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[11]  Jacob D. Abernethy,et al.  An Efficient Bandit Algorithm for sqrt(T) Regret in Online Multiclass Prediction? , 2009, COLT.

[12]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[13]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.