An empirical investigation of the challenges of real-world reinforcement learning

Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. In this work, we identify and formalize a series of independent challenges that embody the difficulties that must be addressed for RL to be commonly deployed in real-world systems. For each challenge, we define it formally in the context of a Markov Decision Process, analyze the effects of the challenge on state-of-the-art learning algorithms, and present some existing attempts at tackling it. We believe that an approach that addresses our set of proposed challenges would be readily deployable in a large number of real world problems. Our proposed challenges are implemented in a suite of continuous control environments called realworldrl-suite which we propose an as an open-source benchmark.

[1]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[3]  Shie Mannor,et al.  Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) , 2016, ArXiv.

[4]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[5]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[6]  Henryk Michalewski,et al.  Distributed Deep Reinforcement Learning: Learn how to play Atari games in 21 minutes , 2018, ISC.

[7]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[8]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[9]  Joseph A. Paradiso,et al.  The gesture recognition toolkit , 2014, J. Mach. Learn. Res..

[10]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[11]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[12]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[13]  OpenAI Learning Dexterous In-Hand Manipulation. , 2018 .

[14]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[15]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[16]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[17]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[18]  Giovanni De Magistris,et al.  OptLayer - Practical Constrained Optimization for Deep Reinforcement Learning in the Real World , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[20]  John Langford,et al.  Making Contextual Decisions with Low Technical Debt , 2016 .

[21]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[22]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[23]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[24]  Raia Hadsell,et al.  Value constrained model-free continuous control , 2019, ArXiv.

[25]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[26]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[27]  Romain Laroche,et al.  Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[28]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[29]  Ang Li,et al.  Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control , 2020, ICLR.

[30]  Tom Schaul,et al.  Unicorn: Continual Learning with a Universal, Off-policy Agent , 2018, ArXiv.

[31]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[32]  Shie Mannor,et al.  A Bayesian Approach to Robust Reinforcement Learning , 2019, UAI.

[33]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[34]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[35]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[36]  Xiaohui Ye,et al.  Horizon: Facebook's Open Source Applied Reinforcement Learning Platform , 2018, ArXiv.

[37]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[38]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[39]  András György,et al.  Learning from Delayed Outcomes with Intermediate Observations , 2018, ArXiv.

[40]  Oleg O. Sushkov,et al.  A Practical Approach to Insertion with Variable Socket Position Using Deep Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[41]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[42]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[43]  Oleg O. Sushkov,et al.  Scaling data-driven robotics with reward sketching and batch reinforcement learning , 2019, Robotics: Science and Systems.

[44]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[45]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[46]  Yisong Yue,et al.  Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes , 2018, AAAI.

[47]  Kiri Wagstaff,et al.  Machine Learning that Matters , 2012, ICML.

[48]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[49]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[50]  Che Wang,et al.  BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning , 2020, NeurIPS.

[51]  Peter Spirtes,et al.  An Anytime Algorithm for Causal Inference , 2001, AISTATS.

[52]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[53]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[54]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[55]  Romain Laroche,et al.  A Fitted-Q Algorithm for Budgeted MDPs , 2018, EWRL 2018.

[56]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[57]  E. Altman Constrained Markov Decision Processes , 1999 .

[58]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[59]  G. Konidaris,et al.  Learning to Plan with Portable Symbols , 2018 .

[60]  Martin A. Riedmiller,et al.  Robust Reinforcement Learning for Continuous Control with Model Misspecification , 2019, ICLR.

[61]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[62]  Chris Pal,et al.  Real-Time Reinforcement Learning , 2019, NeurIPS.

[63]  Shie Mannor,et al.  Situational Awareness by Risk-Conscious Skills , 2016, ArXiv.

[64]  Rui Wang,et al.  Deep Reinforcement Learning for Multiobjective Optimization , 2019, IEEE Transactions on Cybernetics.

[65]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[66]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[67]  Shie Mannor,et al.  Deep Robust Kalman Filter , 2017, ArXiv.

[68]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[69]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[70]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[71]  Jun Wang,et al.  Real-Time Bidding by Reinforcement Learning in Display Advertising , 2017, WSDM.

[72]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[73]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[74]  Craig Boutilier,et al.  Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes , 2016, UAI.

[75]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[76]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[77]  Dale Schuurmans,et al.  Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[78]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[79]  Luc De Raedt,et al.  Anytime Inference in Probabilistic Logic Programs with Tp-Compilation , 2015, IJCAI.

[80]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[81]  Henry Zhu,et al.  ROBEL: Robotics Benchmarks for Learning with Low-Cost Robots , 2019, CoRL.

[82]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[83]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[84]  Shie Mannor,et al.  Action Assembly: Sparse Imitation Learning for Text Based Games with Combinatorial Action Spaces , 2019, ArXiv.

[85]  Shie Mannor,et al.  Probabilistic Goal Markov Decision Processes , 2011, IJCAI.

[86]  Jun Wang,et al.  Real-Time Bidding: A New Frontier of Computational Advertising Research , 2015, WSDM.

[87]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[88]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[89]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[90]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[91]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[92]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[93]  Shie Mannor,et al.  Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning , 2018, NeurIPS.

[94]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[95]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[96]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[97]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[98]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[99]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[100]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[101]  Runzhe Yang,et al.  A Generalized Algorithm for Multi-Objective RL and Policy Adaptation , 2019 .

[102]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[103]  Jianfeng Gao,et al.  Deep Reinforcement Learning with a Natural Language Action Space , 2015, ACL.

[104]  Qing Wang,et al.  Exponentially Weighted Imitation Learning for Batched Historical Data , 2018, NeurIPS.

[105]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[106]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[107]  Craig Boutilier,et al.  RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[108]  A. Cassandra A Survey of POMDP Applications , 2003 .

[109]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[110]  Leslie Pack Kaelbling,et al.  From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning , 2018, J. Artif. Intell. Res..

[111]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[112]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..