The Treachery of Images: Bayesian Scene Keypoints for Deep Policy Learning in Robotic Manipulation

In policy learning for robotic manipulation, sample efficiency is of paramount importance. Thus, learning and extracting more compact representations from camera observations is a promising avenue. However, current methods often assume full observability of the scene and struggle with scale invariance. In many tasks and settings, this assumption does not hold as objects in the scene are often occluded or lie outside the field of view of the camera, rendering the camera observation ambiguous with regard to their location. To tackle this problem, we present BASK, a Bayesian approach to tracking scale-invariant keypoints over time. Our approach successfully resolves inherent ambiguities in images, enabling keypoint tracking on symmetrical objects and occluded and out-of-view objects. We employ our method to learn challenging multi-object robot manipulation tasks from wrist camera observations and demonstrate superior utility for policy learning compared to other representation learning techniques. Furthermore, we show outstanding robustness towards disturbances such as clutter, occlusions, and noisy depth measurements, as well as generalization to unseen objects both in simulation and real-world robotic experiments.

[1]  Rui Chen,et al.  ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills , 2023, ICLR.

[2]  David B. Adrian,et al.  Learning Dense Visual Descriptors using Image Augmentations for Robot Manipulation Tasks , 2022, CoRL.

[3]  Xiaowei Zhou,et al.  OnePose: One-Shot Object Pose Estimation without CAD Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  T. Welschehold,et al.  Learning Long-Horizon Robot Exploration Strategies for Multi-Object Search in Continuous Action Spaces , 2022, ISRR.

[5]  David B. Adrian,et al.  Efficient and Robust Training of Dense Object Nets for Multi-Object Robot Manipulation , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[6]  Chu-Hsing Lin,et al.  Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images , 2022, ECCV.

[7]  Chelsea Finn,et al.  Vision-Based Manipulators Need to Also See from Their Hands , 2022, ICLR.

[8]  Vincent Sitzmann,et al.  Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[9]  T. Welschehold,et al.  Catch Me if You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments With Moving Sounds , 2021, IEEE Robotics and Automation Letters.

[10]  Wolfram Burgard,et al.  Correct Me If I am Wrong: Interactive Learning for Robotic Manipulation , 2021, IEEE Robotics and Automation Letters.

[11]  Dimitrios Kanoulas,et al.  Fully Self-Supervised Class Awareness in Dense Object Descriptors , 2021, CoRL.

[12]  Silvio Savarese,et al.  What Matters in Learning from Offline Human Demonstrations for Robot Manipulation , 2021, CoRL.

[13]  Pieter Abbeel,et al.  Unsupervised Learning of Visual 3D Keypoints for Control , 2021, ICML.

[14]  Fahad Shahbaz Khan,et al.  Intriguing Properties of Vision Transformers , 2021, NeurIPS.

[15]  Lorenzo Natale,et al.  MaskUKF: An Instance Segmentation Aided Unscented Kalman Filter for 6D Object Pose and Velocity Tracking , 2021, Frontiers in Robotics and AI.

[16]  Ankush Gupta,et al.  Representation Matters: Improving Perception and Exploration for Robotics , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Oleg O. Sushkov,et al.  S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Consistency , 2020, CoRL.

[18]  Kindergarten Scope Kitchen , 2020, Definitions.

[19]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Silvio Savarese,et al.  KETO: Learning Keypoint Representations for Tool Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[22]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[23]  Timothy Bretl,et al.  PoseRBPF: A Rao-Blackwellized Particle Filter for6D Object Pose Estimation , 2019, Robotics: Science and Systems.

[24]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[25]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[26]  Silvio Savarese,et al.  DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[28]  Russ Tedrake,et al.  Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.

[29]  P. Abbeel,et al.  Yale-CMU-Berkeley dataset for robotic manipulation research , 2017, Int. J. Robotics Res..

[30]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[31]  Ming-Hsuan Yang,et al.  Long-term correlation tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[33]  Junseok Kwon,et al.  Visual tracking decomposition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ramakant Nevatia,et al.  Tracking multiple humans in crowded environment , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[36]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[39]  T. Welschehold,et al.  N2M2: Learning Navigation for Arbitrary Mobile Manipulation Motions in Unseen and Dynamic Environments , 2022, ArXiv.

[40]  Wolfram Burgard,et al.  Perspectives on Deep Multimodel Robot Learning , 2017, ISRR.

[41]  Bohyung Han,et al.  Bayesian Filtering and Integral Image for Visual Tracking , 2005 .