Joint Estimation of 3D Hand Position and Gestures from Monocular Video for Mobile Interaction

We present a machine learning technique to recognize gestures and estimate metric depth of hands for 3D interaction, relying only on monocular RGB video input. We aim to enable spatial interaction with small, body-worn devices where rich 3D input is desired but the usage of conventional depth sensors is prohibitive due to their power consumption and size. We propose a hybrid classification-regression approach to learn and predict a mapping of RGB colors to absolute, metric depth in real time. We also classify distinct hand gestures, allowing for a variety of 3D interactions. We demonstrate our technique with three mobile interaction scenarios and evaluate the method quantitatively and qualitatively.

[1]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xing-Dong Yang,et al.  Magic finger: always-available input through finger instrumentation , 2012, UIST '12.

[3]  Max Mühlhäuser,et al.  EarPut: augmenting behind-the-ear devices for ear-based interaction , 2013, CHI Extended Abstracts.

[4]  Olivier Bau,et al.  OctoPocus: a dynamic guide for learning gesture-based command sets , 2008, UIST '08.

[5]  Patrick Olivier,et al.  Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor , 2012, UIST.

[6]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[7]  David Sweeney,et al.  Learning to be a depth camera for close-range human capture and interaction , 2014, ACM Trans. Graph..

[8]  Marcos Serrano,et al.  Exploring the use of hand-to-face input for interacting with head-worn displays , 2014, CHI.

[9]  Otmar Hilliges,et al.  In-air gestures around unmodified mobile devices , 2014, UIST.

[10]  Ce Liu,et al.  Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[11]  James R. Eagan,et al.  Watchit: simple gestures and eyes-free interaction for wristwatches and bracelets , 2013, CHI.

[12]  Yuta Sugiura,et al.  SenSkin: adapting skin as a soft interface , 2013, UIST.

[13]  Desney S. Tan,et al.  Skinput: appropriating the body as an input surface , 2010, CHI.

[14]  Desney S. Tan,et al.  Optically sensing tongue gestures for computer input , 2009, UIST '09.