Arms and Hands Segmentation for Egocentric Perspective Based on PSPNet and Deeplab

First person videos and games are the central paradigms of camera positioning when using Head Mounted Displays (HMDs). In these situations, the user’s hands and arms play a fundamental role in self-presence feeling and interface. While their visual image is trivial in Augmented Reality devices or when using depth cameras attached to the HMDs, their rendering is not trivial to be solved with regular HMD, such as those based on smartphone devices. This work proposes the usage of semantic image segmentation with Fully Convolutional Networks for detaching user’s hands and arms from a raw image, captured by regular cameras, positioned as a First Person visual schema. We first create a training dataset composed by 4041 images and a validation dataset composed of 322 images, both of them receive labels for an arm and no-arm pixels, focused on the egocentric view. Then, based on two important architectures related to semantic segmentation - PSPNet and Deeplab - we propose a specific calibration for the particular scenario composed of hands and arms, captured from an HMD perspective. Our results show that PSPNet has better detail segmentation while Deeplab achieves best inference time performance. Training with our egocentric dataset generates better arm segmentation than using images in different and more general perspectives.

[1]  Sanjay Chawla,et al.  On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance , 2013, ICML.

[2]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[3]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[4]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[5]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[6]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[7]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[8]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Matthias Rauterberg,et al.  The Evolution of First Person Vision Methods: A Survey , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[13]  Sanja Fidler,et al.  3D Graph Neural Networks for RGBD Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[15]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[16]  José García Rodríguez,et al.  A survey on deep learning techniques for image and video semantic segmentation , 2018, Appl. Soft Comput..

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Mohan M. Trivedi,et al.  Driver hand localization and grasp analysis: A vision-based real-time approach , 2016, 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC).

[19]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[20]  Stepán Obdrzálek,et al.  Real-Time Human Pose Detection and Tracking for Tele-Rehabilitation in Virtual Reality , 2012, MMVR.

[21]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[22]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[23]  Ramya Hebbalaguppe,et al.  Real Time Hand Segmentation on Frugal Headmounted Device for Gestural Interface , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Luis Herranz,et al.  Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs , 2017, AAAI.

[26]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Zhuowen Tu,et al.  Auto-Context and Its Application to High-Level Vision Tasks and 3D Brain Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Oluwasanmi Koyejo,et al.  Consistent Binary Classification with Generalized Performance Metrics , 2014, NIPS.

[31]  Abdesselam Bouzerdoum,et al.  Skin segmentation using color pixel classification: analysis and comparison , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Daniela Gorski Trevisan,et al.  FPVRGame: Deep Learning for Hand Pose Recognition in Real-Time Using Low-End HMD , 2019, ICEC-JCSG.

[33]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[36]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[38]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[39]  Mel Slater,et al.  A Framework for Immersive Virtual Environments (FIVE): Speculations on the Role of Presence in Virtual Environments , 1997, Presence: Teleoperators & Virtual Environments.