Learning Adversarial Markov Decision Processes with Delayed Feedback

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are revealed to the learner only in the end of episode k+dᵏ, where the delays dᵏ are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of (K+D)¹ᐟ² under full-information feedback, where K is the number of episodes and D=∑ₖ dᵏ is the total delay. Under bandit feedback, we prove similar (K+D)¹ᐟ² regret assuming the costs are stochastic, and (K+D)²ᐟ³ regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.

[1]  Yishay Mansour,et al.  Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions , 2021, ICML.

[2]  Max Simchowitz,et al.  Exploration and Incentives in Reinforcement Learning , 2021, ArXiv.

[3]  Haipeng Luo,et al.  Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case , 2021, ICML.

[4]  Haipeng Luo,et al.  Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition , 2020, COLT.

[5]  Pooria Joulani,et al.  Adapting to Delays and Data in Adversarial Multi-Armed Bandits , 2020, ICML.

[6]  Aleksandrs Slivkins,et al.  Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, COLT.

[7]  Quanquan Gu,et al.  Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation , 2021, ArXiv.

[8]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[9]  Yishay Mansour,et al.  Adversarial Stochastic Shortest Path , 2020, ArXiv.

[10]  Michal Valko,et al.  Stochastic bandits with arm-dependent delays , 2020, ICML.

[11]  Haipeng Luo,et al.  Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition , 2020, NeurIPS.

[12]  Baiming Chen,et al.  Delay-Aware Multi-Agent Reinforcement Learning , 2020, ArXiv.

[13]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[14]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[15]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[16]  Csaba Szepesvári,et al.  A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, variance reduction, and variational bounds , 2020, Theor. Comput. Sci..

[17]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[18]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2019, ICML.

[19]  Alessandro Lazaric,et al.  Frequentist Regret Bounds for Randomized Least-Squares Value Iteration , 2019, AISTATS.

[20]  Julian Zimmert,et al.  An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays , 2019, AISTATS.

[21]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[22]  Nicolò Cesa-Bianchi,et al.  Nonstochastic Multiarmed Bandits with Unrestricted Delays , 2019, NeurIPS.

[23]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[24]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[25]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[26]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[27]  Xi Chen,et al.  Online EXP3 Learning in Adversarial Bandits with Delayed Feedback , 2019, NeurIPS.

[28]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[29]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[30]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[31]  Emma Brunskill,et al.  Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs , 2018, ICML.

[32]  Claudio Gentile,et al.  Nonstochastic Bandits with Composite Anonymous Feedback , 2018, COLT.

[33]  James Bergstra,et al.  Setting up a Reinforcement Learning Task with a Real-World Robot , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[34]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[35]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[36]  Csaba Szepesvári,et al.  Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.

[37]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[38]  Vianney Perchet,et al.  Stochastic Bandit Models for Delayed Conversions , 2017, UAI.

[39]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[40]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[41]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[42]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[43]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[44]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[45]  Kent Quanrud,et al.  Online Learning with Adversarial Delays , 2015, NIPS.

[46]  Peter Xiaoping Liu,et al.  Impact of Communication Delays on Secondary Frequency Control in an Islanded Microgrid , 2015, IEEE Transactions on Industrial Electronics.

[47]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[48]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[49]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[50]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[51]  Alexander Zimin Online Learning in Markovian Decision Processes , 2013 .

[52]  Bessem Sayadi,et al.  Online learning for QoE-based video streaming to mobile receivers , 2012, 2012 IEEE Globecom Workshops.

[53]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[54]  Robert Babuska,et al.  Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Csaba Szepesvari,et al.  The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.

[56]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[57]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[58]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[59]  Thomas J. Walsh,et al.  Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[60]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[61]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[62]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[63]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[64]  Konstantinos V. Katsikopoulos,et al.  Markov decision processes with delays and asynchronous cost collection , 2003, IEEE Trans. Autom. Control..

[65]  E. Ordentlich,et al.  On delayed prediction of individual sequences , 2002, Proceedings IEEE International Symposium on Information Theory,.

[66]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[67]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.