Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics

Applications of Reinforcement Learning (RL) in robotics are often limited by high data demand. On the other hand, approximate models are readily available in many robotics scenarios, making model-based approaches like planning a data-efficient alternative. Still, the performance of these methods suffers if the model is imprecise or wrong. In this sense, the respective strengths and weaknesses of RL and modelbased planners are complementary. In the present work, we investigate how both approaches can be integrated into one framework that combines their strengths. We introduce Learning to Execute (L2E), which leverages information contained in approximate plans to learn universal policies that are conditioned on plans. In our robotic manipulation experiments, L2E exhibits increased performance when compared to pure RL, pure planning, or baseline methods combining learning and planning.

[1]  Sonia Chernova,et al.  Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[2]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[3]  Satinder Singh,et al.  Many-Goals Reinforcement Learning , 2018, ArXiv.

[4]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[5]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[6]  Jackie Kay,et al.  Self-Supervised Sim-to-Real Adaptation for Visual Robotic Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[8]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[9]  Andrew W. Moore,et al.  Multi-Value-Functions: Efficient Automatic Action Hierarchies for Multiple Goal MDPs , 1999, IJCAI.

[10]  Oleg O. Sushkov,et al.  Scaling data-driven robotics with reward sketching and batch reinforcement learning , 2019, Robotics: Science and Systems.

[11]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[12]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[13]  Marc Toussaint,et al.  Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks , 2021, ICLR.

[14]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[15]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[16]  Andrea Lockerd Thomaz,et al.  Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Tim Salimans,et al.  Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[19]  Tadahiro Taniguchi,et al.  Integration of imitation learning using GAIL and reinforcement learning using task-achievement rewards via probabilistic graphical model , 2020, Adv. Robotics.

[20]  Scott Kuindersma,et al.  Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot , 2015, Autonomous Robots.

[21]  Wojciech Zaremba,et al.  Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model , 2016, ArXiv.

[22]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[23]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[25]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[26]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[27]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[28]  Lydia Tapia,et al.  PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-Based Planning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  Sonia Chernova,et al.  Recent Advances in Robot Learning from Demonstration , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[31]  Sergey Levine,et al.  Planning with Goal-Conditioned Policies , 2019, NeurIPS.

[32]  Peter Dayan,et al.  Structure in the Space of Value Functions , 2002, Machine Learning.

[33]  Sergey Levine,et al.  Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Marc Toussaint,et al.  Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning , 2018, Robotics: Science and Systems.

[35]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[36]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[37]  Honglak Lee,et al.  Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards , 2020, NeurIPS.

[38]  R. Bellman A Markovian Decision Process , 1957 .

[39]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.