Where is my hand? Deep hand segmentation for visual self-recognition in humanoid robots

The ability to distinguish between the self and the background is of paramount importance for robotic tasks. The particular case of hands, as the end effectors of a robotic system that more often enter into contact with other elements of the environment, must be perceived and tracked with precision to execute the intended tasks with dexterity and without colliding with obstacles. They are fundamental for several applications, from Human-Robot Interaction tasks to object manipulation. Modern humanoid robots are characterized by high number of degrees of freedom which makes their forward kinematics models very sensitive to uncertainty. Thus, resorting to vision sensing can be the only solution to endow these robots with a good perception of the self, being able to localize their body parts with precision. In this paper, we propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view. It is known that CNNs require a huge amount of data to be trained. To overcome the challenge of labeling real-world images, we propose the use of simulated datasets exploiting domain randomization techniques. We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy. We focus our attention on developing a methodology that requires low amounts of data to achieve reasonable performance while giving detailed insight on how to properly generate variability in the training dataset. Moreover, we analyze the fine-tuning process within the complex model of Mask-RCNN, understanding which weights should be transferred to the new task of segmenting robot hands. Our final model was trained solely on synthetic images and achieves an average IoU of 82% on synthetic validation data and 56.3% on real test data. These results were achieved with only 1000 training images and 3 hours of training time using a single GPU.

[1]  Susan M. Hughes,et al.  The processing of auditory and visual recognition of self-stimuli , 2010, Consciousness and Cognition.

[2]  James M. Rehg,et al.  Statistical Color Models with Application to Skin Detection , 2004, International Journal of Computer Vision.

[3]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Alexandre Bernardino,et al.  Robotic Hand Pose Estimation Based on Stereo Vision and GPU-enabled Internal Graphical Simulation , 2016, J. Intell. Robotic Syst..

[5]  Matej Hoffmann,et al.  Robot Self-Calibration Using Multiple Kinematic Chains—A Simulation Study on the iCub Humanoid Robot , 2018, IEEE Robotics and Automation Letters.

[6]  Jürgen Leitner,et al.  Humanoid learns to detect its own hands , 2013, 2013 IEEE Congress on Evolutionary Computation.

[7]  Gordon Cheng,et al.  Yielding Self-Perception in Robots Through Sensorimotor Contingencies , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[8]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Verena V. Hafner,et al.  Self-supervised Body Image Acquisition Using a Deep Neural Network for Sensorimotor Prediction , 2019, 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[11]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[12]  Lorenzo Natale,et al.  Visual end-effector tracking using a 3D model-aided particle filter for humanoid robot platforms , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Miguel Aragão,et al.  Vizzy: A Humanoid on Wheels for Assistive Robotics , 2015, ROBOT.

[15]  Nikolaos G. Bourbakis,et al.  A survey of skin-color modeling and detection methods , 2007, Pattern Recognit..

[16]  Alexandre Bernardino,et al.  Incremental adaptation of a robot body schema based on touch events , 2018, 2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[17]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[18]  Alexandre Bernardino,et al.  Towards markerless visual servoing of grasping tasks for humanoid robots , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[21]  Ali Borji,et al.  Analysis of Hand Segmentation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ronan Collobert,et al.  Learning to Segment Object Candidates , 2015, NIPS.

[25]  Alexandre Bernardino,et al.  Online Body Schema Adaptation Based on Internal Mental Simulation and Multisensory Feedback , 2016, Front. Robot. AI.

[26]  Hui Yu,et al.  Gesture recognition based on binocular vision , 2018, Cluster Computing.

[27]  Alexandre Bernardino,et al.  Learning at the ends: From hand to tool affordances in humanoid robots , 2017, 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Alexandre Bernardino,et al.  2D Visual Servoing meets Rapidly-exploring Random Trees for collision avoidance , 2020, 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC).

[32]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[33]  Ken Perlin,et al.  Improving noise , 2002, SIGGRAPH.

[34]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.