Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays

Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that algorithms fail to learn anything substantial. This paper formally describes the notion of Markov Decision Processes (MDPs) with stochastic delays and shows that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with stochastic delays in actions and observations. The delay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety of environments comprising of multi-step and stochastic delays and results in better performance, both in terms of achieving near-optimal rewards and minimizing the computational overhead thereof, with respect to the currently established algorithms.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Tsuneo Yoshikawa,et al.  Ground-space bilateral teleoperation of ETS-VII robot arm by direct bilateral coupling under 7-s time delay condition , 2004, IEEE Transactions on Robotics and Automation.

[3]  Mridul Agarwal,et al.  Blind Decision Making: Reinforcement Learning with Delayed Observations , 2020, ICAPS.

[4]  Hartmut Logemann,et al.  Destabilizing effects of small time delays on feedback-controlled descriptor systems☆ , 1998 .

[5]  Thomas J. Walsh,et al.  Planning and Learning in Environments with Delayed Feedback , 2007, ECML.

[6]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[7]  Eitan Altman,et al.  Congestion control as a stochastic control problem with action delays , 1999, Autom..

[8]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[9]  Chelsea C. WhiteIII Note on “A Partially Observable Markov Decision Process with Lagged Information” , 1988 .

[10]  Elizabeth Gibney,et al.  Google AI algorithm masters ancient game of Go , 2016, Nature.

[11]  P J Beek,et al.  Theoretical analysis of destabilization resonances in time-delayed stochastic second-order dynamical systems and some implications for human motor control. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[14]  Milind Tambe,et al.  Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance , 2020, medRxiv : the preprint server for health sciences.

[15]  Robert Babuska,et al.  Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Maolin Jin,et al.  Robust Compliant Motion Control of Robot With Nonlinear Friction Using Time-Delay Estimation , 2008, IEEE Transactions on Industrial Electronics.

[17]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[18]  Chelsea C. White,et al.  Markov decision processes with noise-corrupted and delayed state observations , 1999, J. Oper. Res. Soc..

[19]  Shinhong Kim,et al.  A Partially Observable Markov Decision Process with Lagged Information , 1987 .

[20]  M. Battegay,et al.  Reproductive number of the COVID-19 epidemic in Switzerland with a focus on the Cantons of Basel-Stadt and Basel-Landschaft. , 2020, Swiss medical weekly.

[21]  M. Fan,et al.  Effect of delay in diagnosis on transmission of COVID-19. , 2020, Mathematical biosciences and engineering : MBE.

[22]  Jing Zhao,et al.  Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia , 2020, The New England journal of medicine.

[23]  Cornelius T. Leondes,et al.  Technical Note - Markov Decision Processes with State-Information Lag , 1972, Oper. Res..

[24]  Thomas J. Walsh,et al.  Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[25]  Konstantinos V. Katsikopoulos,et al.  Markov decision processes with delays and asynchronous cost collection , 2003, IEEE Trans. Autom. Control..

[26]  Dennis Huisman,et al.  Delay Propagation and Delay Management in Transportation Networks , 2018 .

[27]  Jonathan Binas,et al.  Reinforcement Learning with Random Delays , 2021, ICLR.

[28]  Liang Li,et al.  Delay-Aware Model-Based Reinforcement Learning for Continuous Control , 2020, Neurocomputing.

[29]  S. Kim State information lag markov decision process with control limit rule , 1985 .

[30]  Karol Hausman,et al.  Thinking While Moving: Deep Reinforcement Learning with Concurrent Control , 2020, ICLR.

[31]  A. Markman,et al.  The Curse of Planning: Dissecting Multiple Reinforcement-Learning Systems by Taxing the Central Executive , 2013 .

[32]  Chris Pal,et al.  Real-Time Reinforcement Learning , 2019, NeurIPS.

[33]  Eitan Altman,et al.  Closed-loop control with delayed information , 1992, SIGMETRICS '92/PERFORMANCE '92.

[34]  Wotao Yin,et al.  On Unbounded Delays in Asynchronous Parallel Fixed-Point Algorithms , 2016, J. Sci. Comput..

[35]  Shie Mannor,et al.  Acting in Delayed Environments with Non-Stationary Markov Policies , 2021, ICLR.