论文信息 - Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning - 字舞流文

Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning

Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.

D. Borrajo | Sumitra Ganesh | Alberto Pozanco | Leo Ardon

[1] Sumitra Ganesh,et al. Factored Policy Gradients: Leveraging Structure for Efficient Learning in MOMDPs , 2021, NeurIPS.

[2] Shengyi Huang,et al. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms , 2020, FLAIRS.

[3] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[4] Hector Geffner,et al. Learning First-Order Symbolic Representations for Planning from the Structure of the State Space , 2019, ECAI.

[5] Tom Schaul,et al. Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[6] Shie Mannor,et al. Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning , 2018, NeurIPS.

[7] Max Welling,et al. Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[8] Michael I. Jordan,et al. RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[9] Tom Schaul,et al. StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[10] Alex S. Fukunaga,et al. Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary , 2017, AAAI.

[11] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[12] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[13] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[14] Javier García,et al. Probabilistic Policy Reuse for inter-task transfer learning , 2010, Robotics Auton. Syst..

[15] Andrew K. C. Wong,et al. Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[16] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[17] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[18] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19] Richard Fikes,et al. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[20] A. Gleave,et al. Stable-Baselines3: Reliable Reinforcement Learning Implementations , 2021, J. Mach. Learn. Res..

[21] Stefan Edelkamp,et al. Automated Planning: Theory and Practice , 2007, Künstliche Intell..