Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Model-free reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) often require additional exploration strategies, especially if the actor is of deterministic nature. This work evaluates the use of model-based trajectory optimization methods used for exploration in Deep Deterministic Policy Gradient when trained on a latent image embedding. In addition, an extension of DDPG is derived using a value function as critic, making use of a learned deep dynamics model to compute the policy gradient. This approach leads to a symbiotic relationship between the deep reinforcement learning algorithm and the latent trajectory optimizer. The trajectory optimizer benefits from the critic learned by the RL algorithm and the latter from the enhanced exploration generated by the planner. The developed methods are evaluated on two continuous control tasks, one in simulation and one in the real world. In particular, a Baxter robot is trained to perform an insertion task, while only receiving sparse rewards and images as observations from the environment.

[1]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[2]  Heni Ben Amor,et al.  Extracting bimanual synergies with reinforcement learning , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[4]  Russ Tedrake Learning to Fly like a Bird , 2009 .

[5]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[6]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[7]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[8]  Christopher G. Atkeson,et al.  Using Deep Reinforcement Learning to Learn High-Level Policies on the ATRIAS Biped , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Jan Peters,et al.  Latent space policy search for robotics , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  David Janz,et al.  Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[12]  Fawzi Nashashibi,et al.  End-to-End Race Driving with Deep Reinforcement Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .

[14]  Heni Ben Amor,et al.  From the Lab to the Desert: Fast Prototyping and Learning of Robot Locomotion , 2017, Robotics: Science and Systems.

[15]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[16]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[17]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[18]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Allan Jabri,et al.  Universal Planning Networks , 2018, ICML.

[20]  Meng Wei,et al.  Robot skill acquisition in assembly process using deep reinforcement learning , 2019, Neurocomputing.

[21]  Terrence J. Sejnowski,et al.  Glider soaring via reinforcement learning in the field , 2018, Nature.

[22]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[23]  Carme Torras,et al.  Exploiting Symmetries in Reinforcement Learning of Bimanual Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[24]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[25]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.