Efficient Planning in a Compact Latent Action Space

Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.

[1]  Kan Ren,et al.  Bootstrapped Transformer for Offline Reinforcement Learning , 2022, NeurIPS.

[2]  S. Levine,et al.  Planning with Diffusion for Flexible Behavior Synthesis , 2022, ICML.

[3]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[4]  S. Levine,et al.  ASE , 2022, ACM Trans. Graph..

[5]  Amy Zhang,et al.  Online Decision Transformer , 2022, ICML.

[6]  Pieter Abbeel,et al.  Mastering Atari Games with Limited Data , 2021, NeurIPS.

[7]  Robert Dadashi,et al.  Continuous Control with Action Quantization from Demonstrations , 2021, ICML.

[8]  Sergey Levine,et al.  Offline Reinforcement Learning with Implicit Q-Learning , 2021, ICLR.

[9]  Michael A. Osborne,et al.  Revisiting Design Choices in Offline Model Based Reinforcement Learning , 2021, ICLR.

[10]  Jeannette Bohg,et al.  Learning latent actions to control assistive robots , 2021, Autonomous Robots.

[11]  Dan Klein,et al.  Learning Space Partitions for Path Planning , 2021, NeurIPS.

[12]  Scott Fujimoto,et al.  A Minimalist Approach to Offline Reinforcement Learning , 2021, NeurIPS.

[13]  Ali Razavi,et al.  Vector Quantized Models for Planning , 2021, ICML.

[14]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[15]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[16]  Dorsa Sadigh,et al.  Learning Visually Guided Latent Actions for Assistive Teleoperation , 2021, L4DC.

[17]  Silvio Savarese,et al.  LASER: Learning a Latent Action Space for Efficient Reinforcement Learning , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[18]  David Held,et al.  PLAS: Latent Action Space for Offline Reinforcement Learning , 2020, CoRL.

[19]  Jessica B. Hamrick,et al.  On the role of planning in model-based deep reinforcement learning , 2020, ICLR.

[20]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[21]  Andrew Gordon Wilson,et al.  On the model-based stochastic value gradient for continuous reinforcement learning , 2020, L4DC.

[22]  Wei Chen,et al.  TrajVAE: A Variational AutoEncoder model for trajectory generation , 2020, Neurocomputing.

[23]  Yuandong Tian,et al.  Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search , 2020, NeurIPS.

[24]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[27]  T. Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[28]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[29]  Juergen Schmidhuber,et al.  Reinforcement Learning Upside Down: Don't Predict Rewards - Just Map Them to Actions , 2019, ArXiv.

[30]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[31]  D. Fox,et al.  IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[33]  Takeshi Nishida,et al.  Trajectory Prediction with a Conditional Variational Autoencoder , 2019, J. Robotics Mechatronics.

[34]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[35]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[36]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[37]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[38]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[39]  Katja Hofmann,et al.  Trajectory VAE for multi-modal imitation , 2018 .

[40]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[41]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[44]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[45]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[46]  Martin A. Riedmiller,et al.  Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[47]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[48]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[49]  Michael Fairbank,et al.  Reinforcement Learning by Value Gradients , 2008, ArXiv.

[50]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[51]  Sven Koenig,et al.  Abstraction, Reformulation, and Approximation: 5th International Symposium, SARA 2002, Kananaskis, Alberta, Canada, August 2-4, 2002, Proceedings , 2002 .

[52]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[53]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[54]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[55]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[57]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .