A Method for Detecting Interaction between 3D Hands and Unknown Objects in RGB Video

We propose a model that can extract 3D position of hand and object in per-frame of RGB videos through a single feed-forward neural network and a zero-shot learning classifier, and understand unknown hand-object interactions in the entire video through an interactive temporal module. The process is trained end-to-end, without depth images or annotated coordinates as input, which has good application prospects in real life.

[1]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Marc Pollefeys,et al.  H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[4]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[5]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[7]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Luc Van Gool,et al.  European conference on computer vision (ECCV) , 2006, eccv 2006.