No-Regret Learning in Unknown Games with Correlated Payoffs

We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.

[1]  Larry J. LeBlanc,et al.  An Algorithm for the Discrete Network Design Problem , 1975 .

[2]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  B Skyrms,et al.  A dynamic model of social network formation. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[7]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[8]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[9]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[10]  Rémi Munos,et al.  Online Learning in Adversarial Lipschitz Environments , 2010, ECML/PKDD.

[11]  Itay P. Fainmesser Community Structure and Market Outcomes: A Repeated Games in Networks Approach , 2010 .

[12]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[13]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[14]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[15]  M. Jackson,et al.  Games on Networks , 2014 .

[16]  Archie C. Chapman,et al.  Convergent Learning Algorithms for Unknown Reward Games , 2013, SIAM J. Control. Optim..

[17]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[18]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[19]  Andreas Krause,et al.  Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation , 2016, NIPS.

[20]  Karan Singh,et al.  Efficient Regret Minimization in Non-Convex Games , 2017, ICML.

[21]  Yin Tat Lee,et al.  Kernel-based methods for bandit convex optimization , 2016, STOC.

[22]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[23]  Volkan Cevher,et al.  Adversarially Robust Optimization with Gaussian Processes , 2018, NeurIPS.