Residual Squeeze-and-Excitation Network with Multi-scale Spatial Pyramid Module for Fast Robotic Grasping Detection

This paper proposes an efficient, fully convolutional neural network to generate robotic grasps by using 300×300 depth images as input. Specifically, a residual squeeze-and-excitation network (RSEN) is introduced for deep feature extraction. Following the RSEN block, a multi-scale spatial pyramid module (MSSPM) is developed to obtain multi-scale contextual information. The outputs of each RSEN block and MSSPM are combined as inputs for hierarchical feature fusion. Then, the fused global features are upsampled to perform pixel-wise learning for grasping pose estimation. The experimental results on Cornell and Jacquard grasping datasets indicate that the proposed method has a fast inference speed of 5ms while achieving high grasp detection accuracy of 96.4% and 94.8% on Cornell and Jacquard, respectively, which strikes a balance between accuracy and running speed. Our method also gets a 90% physical grasp success rate with a UR5 robot arm.