Learning to Interact With Learning Agents

AI and machine learning methods are increasingly interacting with and seeking information from people, robots, and other learning agents. Consequently, the learning dynamics of these agents creates fundamentally new challenges for existing methods. Motivated by the application of learning to offer personalized deals to users, we highlight these challenges by studying a variant of the framework of “online learning using expert advice with bandit feedback". In our setting, we consider each expert as a learning agent, seeking to more accurately reflect real-world applications. The bandit feedback leads to additional challenges in this setting: at time t, only the expert i that has been selected by the central algorithm (forecaster) receives feedback from the environment and gets to learn at this time. A natural question to ask is whether it is possible to be competitive with the best expert j∗ had it seen all the feedback, i.e., competitive with the policy of always selecting expert j∗. We prove the following hardness result—without any coordination between the forecaster and the experts, it is impossible to design a forecaster achieving no-regret guarantees. We then consider a practical assumption allowing the forecaster to guide the learning process of the experts by blocking some of the feedback observed by them from the environment, i.e., restricting the selected expert i to learn at time t for some time steps. With this additional coordination power, we design our forecaster LIL that achieves no-regret guarantees, and we provide regret bounds dependent on the learning dynamics of the best expert j∗.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[3]  Minyue Fu Switching Adaptive Control , 2015, Encyclopedia of Systems and Control.

[4]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[5]  Andreas Krause,et al.  Learning User Preferences to Incentivize Exploration in the Sharing Economy , 2017, AAAI.

[6]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[7]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[8]  Y. Mansour,et al.  Algorithmic Game Theory: Learning, Regret Minimization, and Equilibria , 2007 .

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Andreas Krause,et al.  Actively Learning Hemimetrics with Applications to Eliciting User Preferences , 2016, ICML.

[11]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[12]  Matthew J. Streeter,et al.  Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[13]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[14]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[15]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[16]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[17]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[18]  Haipeng Luo,et al.  Online Gradient Boosting , 2015, NIPS.

[19]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[20]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  Y. Mansour,et al.  4 Learning , Regret minimization , and Equilibria , 2006 .

[23]  Hsuan-Tien Lin,et al.  Active Learning by Learning , 2015, AAAI.

[24]  Benjamin Edelman,et al.  To Groupon or not to Groupon: The profitability of deep discounts , 2016 .

[25]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[26]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[27]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[28]  Haipeng Luo,et al.  Optimal and Adaptive Algorithms for Online Boosting , 2015, ICML.

[29]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.