Learning to Collaborate in Markov Decision Processes

We consider a two-agent MDP framework where agents repeatedly solve a task in a collaborative setting. We study the problem of designing a learning algorithm for the first agent (A1) that facilitates a successful collaboration even in cases when the second agent (A2) is adapting its policy in an unknown way. The key challenge in our setting is that the first agent faces non-stationarity in rewards and transitions because of the adaptive behavior of the second agent. We design novel online learning algorithms for agent A1 whose regret decays as $O(T^{\max\{1-\frac{3}{7} \cdot \alpha, \frac{1}{4}\}})$ with $T$ learning episodes provided that the magnitude of agent A2's policy changes between any two consecutive episodes are upper bounded by $O(T^{-\alpha})$. Here, the parameter $\alpha$ is assumed to be strictly greater than $0$, and we show that this assumption is necessary provided that the learning parity with noise problem is computationally hard. We show that sub-linear regret of agent A1 further implies near-optimality of the agents' joint return for MDPs that manifest the properties of a smooth game.

[1]  Amin Karbasi,et al.  On Actively Teaching the Crowd to Classify , 2013, NIPS 2013.

[2]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[3]  Manuel Lopes,et al.  Algorithmic and Human Teaching of Sequential Decision Tasks , 2012, AAAI.

[4]  Xiaojin Zhu,et al.  Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education , 2015, AAAI.

[5]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[6]  Peter L. Bartlett,et al.  Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions , 2013, NIPS.

[7]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[8]  Elad Hazan,et al.  Better Rates for Any Adversarial Deterministic MDP , 2013, ICML.

[9]  Adam Tauman Kalai,et al.  On agnostic boosting and parity learning , 2008, STOC.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  David C. Parkes,et al.  Policy teaching through reward function learning , 2009, EC '09.

[12]  Christos Dimitrakakis,et al.  Multi-View Decision Processes: The Helper-AI Problem , 2017, NIPS.

[13]  András György,et al.  Online Learning in Markov Decision Processes with Changing Cost Sequences , 2014, ICML.

[14]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[15]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[16]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[17]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[18]  Ofra Amir,et al.  Interactive Teaching Strategies for Agent Training , 2016, IJCAI.

[19]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[20]  Andreas Krause,et al.  Learning to Interact With Learning Agents , 2018, AAAI.

[21]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[22]  Siddhartha S. Srinivasa,et al.  Game-Theoretic Modeling of Human Adaptation in Human-Robot Collaboration , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[23]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[24]  Krzysztof Pietrzak,et al.  Cryptography from Learning Parity with Noise , 2012, SOFSEM.

[25]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[26]  Thomas Steinke,et al.  Learning hurdles for sleeping experts , 2012, ITCS '12.

[27]  Mohammad Taghi Hajiaghayi,et al.  Regret minimization and the price of total anarchy , 2008, STOC.

[28]  Sandra Zilles,et al.  An Overview of Machine Teaching , 2018, ArXiv.

[29]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[30]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[31]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[32]  Vatsal Sharan,et al.  Prediction with a short memory , 2016, STOC.

[33]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..