Policy Learning with Continuous Memory States for Partially Observed Robotic Control

Policy learning for partially observed control tasks requires policies that have the ability to store information from past observations. In this paper, we present a method for learning policies with memory for high-dimensional, continuous systems. Our approach does not assume a known state representation and does not attempt to explicitly model the belief over the unobserved state. Instead, we directly learn a policy that can read and write from an internal continuous-valued memory. This type of policy can be interpreted as a type of recurrent neural network (RNN). However, our approach avoids many of the common problems that plague RNNs, such as the vanishing and exploding gradient issues, by instead representing the memory as state variables. The policy is then optimized by using a guided policy search algorithm that alternates between optimizing trajectories through state space (including both physical and memory states), and training a policy with supervised learning to match these trajectories. We evaluate our method on tasks involving continuous control in manipulation and navigation settings, and show that our method can learn complex policies that successfully complete a range of tasks that require memory.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[3]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[4]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[5]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[6]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[7]  Leslie Pack Kaelbling,et al.  Continuous-State POMDPs with Hybrid Dynamics , 2008, ISAIM.

[8]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[9]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[10]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[11]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[15]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..