Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation

Reflecting on the last few years, the biggest breakthroughs in deep reinforcement learning (RL) have been in the discrete action domain. Robotic manipulation, however, is inherently a continuous control environment, but these continuous control reinforcement learning algorithms often depend on actor-critic methods that are sample-inefficient and inherently difficult to train, due to the joint optimisation of the actor and critic. To that end, we explore how we can bring the stability of discrete action RL algorithms to the robot manipulation domain. We extend the recently released ARM algorithm, by replacing the continuous next-best pose agent with a discrete next-best pose agent. Discretisation of rotation is trivial given its bounded nature, while translation is inherently unbounded, making discretisation difficult. We formulate the translation prediction as the voxel prediction problem by discretising the 3D space; however, voxelisation of a large workspace is memory intensive and would not work with a high density of voxels, crucial to obtaining the resolution needed for robotic manipulation. We therefore propose to apply this voxel prediction in a coarse-to-fine manner by gradually increasing the resolution. In each step, we extract the highest valued voxel as the predicted location, which is then used as the centre of the higher-resolution voxelisation in the next step. This coarse-to-fine prediction is applied over several steps, giving a near-lossless prediction of the translation. We show that our new coarse-to-fine algorithm is able to accomplish RLBench tasks much more efficiently than the continuous control equivalent, and even train some real-world tasks, tabular rasa, in less than 7 minutes, with only 3 demonstrations. Moreover, we show that by moving to a voxel representation, we are able to easily incorporate observations from multiple cameras. Videos and code found at: https://sites.google.com/ view/c2f-q-attention.

[1]  Andrew J. Davison,et al.  Sim-to-Real Reinforcement Learning for Deformable Object Manipulation , 2018, CoRL.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Jean-Claude Latombe,et al.  A Single-Query Bi-Directional Probabilistic Roadmap Planner with Lazy Collision Checking , 2001, ISRR.

[4]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[5]  Masayuki Inaba,et al.  Probabilistic 3D multilabel real-time mapping for multi-object manipulation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[7]  Lydia E. Kavraki,et al.  The Open Motion Planning Library , 2012, IEEE Robotics & Automation Magazine.

[8]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[9]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[10]  Jizhong Xiao,et al.  Multi-volume occupancy grids: An efficient probabilistic 3D mapping model for micro aerial vehicles , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[12]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[13]  Andrew J. Davison,et al.  Learning One-Shot Imitation From Humans Without Humans , 2019, IEEE Robotics and Automation Letters.

[14]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[15]  Maren Bennewitz,et al.  Navigation in three-dimensional cluttered environments for mobile manipulation , 2012, 2012 IEEE International Conference on Robotics and Automation.

[16]  Andrew J. Davison,et al.  Task-Embedded Control Networks for Few-Shot Imitation Learning , 2018, CoRL.

[17]  Hans P. Moravec Robot spatial perception by stereoscopic vision and 3D evidence grids , 1996 .

[18]  Thomas Funkhouser,et al.  Grasping in the Wild: Learning 6DoF Closed-Loop Grasping From Low-Cost Demonstrations , 2020, IEEE Robotics and Automation Letters.

[19]  Sergey Levine,et al.  MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale , 2021, ArXiv.

[20]  Andrew J. Davison,et al.  MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stephen James,et al.  3D Simulation for Robot Arm Control with Deep Q-Learning , 2016, ArXiv.

[22]  M. Land,et al.  The Roles of Vision and Eye Movements in the Control of Activities of Daily Living , 1998, Perception.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[25]  Alberto Rodriguez,et al.  Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Adam W. Harley,et al.  Embodied Language Grounding with Implicit 3D Visual Feature Representations , 2019, ArXiv.

[27]  Roland Siegwart,et al.  Volumetric Grasping Network: Real-time 6 DOF Grasp Detection in Clutter , 2021, ArXiv.

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[30]  Maren Bennewitz,et al.  Humanoid robot localization in complex indoor environments , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Ramesh C. Jain,et al.  Building an environment model using depth information , 1989, Computer.

[32]  Fabian Falck,et al.  Ivy: Templated Deep Learning for Inter-Framework Portability , 2021, ArXiv.

[33]  Stephen James,et al.  Q-Attention: Enabling Efficient Learning for Vision-Based Robotic Manipulation , 2022, IEEE Robotics and Automation Letters.

[34]  Sergey Levine,et al.  Composable Deep Reinforcement Learning for Robotic Manipulation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).