6D Object Pose Estimation with Attention Networks

Rigid object 6D pose estimate is a crucial topic in robotic manipulation and grasping tasks. Especially learning-based image and/or point cloud approaches for this task have drawn many research attentions. Current existing approaches either have complicated pipelines for processing data sources or time-consuming post-processing steps. Furthermore, the fusion strategies of image and depth/point cloud are simplified and lack of the potential to fully exploit their complimentary advantages. In this paper, we present an attention network: Channel-Spatial Attention Network to predict the 6D object pose from image and point cloud. Our network consists of two available networks (PSPNet, PointNet) to process image and point cloud separately and an attention-fusing module to deal with concatenated feature embeddings. The attention-fusing module can effectively leverage the fused embedding allowing accurate 6D pose prediction on known objects. We test our proposed network on the LineMod dataset and the results demonstrate it can achieve a remarkable performance compared with other methods.