Learning to Plan via a Multi-Step Policy Regression Method

We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. This is for example the case for maze environments where ideally an optimal path is determined. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. Our proposed method called policy horizon regression (PHR) uses knowledge of the environment sampled by A2C to learn an n dimensional policy vector in a policy distillation setup which yields n sequential actions per observation. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.

[1]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[2]  MahadevanSridhar,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003 .

[3]  Shie Mannor,et al.  Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[4]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[5]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[6]  Xipeng Shen,et al.  Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse , 2019, ICS.

[7]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[8]  Aleksandr I. Panov,et al.  Grid Path Planning with Deep Reinforcement Learning: Preliminary Results , 2017, BICA.

[9]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[10]  Kee-Eung Kim,et al.  Reinforcement Learning for Control with Multiple Frequencies , 2020, NeurIPS.

[11]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[12]  Balaraman Ravindran,et al.  Dynamic Action Repetition for Deep Reinforcement Learning , 2017, AAAI.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[15]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[16]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[17]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.