论文信息 - No-Regret Learning in Unknown Games with Correlated Payoffs

No-Regret Learning in Unknown Games with Correlated Payoffs

We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.

[1] Larry J. LeBlanc,et al. An Algorithm for the Discrete Network Design Problem , 1975 .

[2] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[3] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4] B Skyrms,et al. A dynamic model of social network formation. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[7] Christopher K. I. Williams,et al. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[8] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[9] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[10] Rémi Munos,et al. Online Learning in Adversarial Lipschitz Environments , 2010, ECML/PKDD.

[11] Itay P. Fainmesser. Community Structure and Market Outcomes: A Repeated Games in Networks Approach , 2010 .

[12] Andreas Krause,et al. Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[13] Aleksandrs Slivkins,et al. Contextual Bandits with Similarity Information , 2009, COLT.

[14] Csaba Szepesvari,et al. Online learning for linearly parametrized control problems , 2012 .

[15] M. Jackson,et al. Games on Networks , 2014 .

[16] Archie C. Chapman,et al. Convergent Learning Algorithms for Unknown Reward Games , 2013, SIAM J. Control. Optim..

[17] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[18] Haipeng Luo,et al. Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[19] Andreas Krause,et al. Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation , 2016, NIPS.

[20] Karan Singh,et al. Efficient Regret Minimization in Non-Convex Games , 2017, ICML.

[21] Yin Tat Lee,et al. Kernel-based methods for bandit convex optimization , 2016, STOC.

[22] Aditya Gopalan,et al. On Kernelized Multi-armed Bandits , 2017, ICML.

[23] Volkan Cevher,et al. Adversarially Robust Optimization with Gaussian Processes , 2018, NeurIPS.