GestARLite: An On-Device Pointing Finger Based Gestural Interface for Smartphones and Video See-Through Head-Mounts

Hand gestures form an intuitive means of interaction in Mixed Reality (MR) applications. However, accurate gesture recognition can be achieved only through state-of-the-art deep learning models or with the use of expensive sensors. Despite the robustness of these deep learning models, they are generally computationally expensive and obtaining real-time performance on-device is still a challenge. To this end, we propose a novel lightweight hand gesture recognition framework that works in First Person View for wearable devices. The models are trained on a GPU machine and ported on an Android smartphone for its use with frugal wearable devices such as the Google Cardboard and VR Box. The proposed hand gesture recognition framework is driven by a cascade of state-of-the-art deep learning models: MobileNetV2 for hand localisation, our custom fingertip regression architecture followed by a Bi-LSTM model for gesture classification. We extensively evaluate the framework on our EgoGestAR dataset. The overall framework works in real-time on mobile devices and achieves a classification accuracy of 80% on EgoGestAR video dataset with an average latency of only 0.12 s.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3]  Nicolas D. Georganas,et al.  Real-Time Hand Gesture Detection and Recognition Using Bag-of-Features and Support Vector Machine Techniques , 2011, IEEE Transactions on Instrumentation and Measurement.

[4]  Tobias Höllerer,et al.  Augmented reality: principles and practice , 2016, SIGGRAPH Courses.

[5]  Luca Benini,et al.  Gesture Recognition in Ego-centric Videos Using Dense Trajectories and Hand Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lovekesh Vig,et al.  An AR Inspection Framework: Feasibility Study with Multiple AR Devices , 2016, 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct).

[8]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[9]  Yichao Huang,et al.  DeepFinger: A Cascade Convolutional Neuron Network Approach to Finger Key Point Detection in Egocentric Vision with Mobile Camera , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Nasser Kehtarnavaz,et al.  Comparison of two real-time hand gesture recognition systems involving stereo cameras, depth camera, and inertial sensor , 2014, Photonics Europe.

[12]  Yichao Huang,et al.  A Pointing Gesture Based Egocentric Interaction System: Dataset, Approach and Application , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[14]  Wolfgang Hürst,et al.  Gesture-based interaction via finger tracking for mobile augmented reality , 2011, Multimedia Tools and Applications.

[15]  Ehtesham Hassan,et al.  GestAR: Real Time Gesture Interaction for AR with Egocentric View , 2016, 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct).

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Stefan Wermter,et al.  Gesture Recognition with a Convolutional Long Short-Term Memory Recurrent Neural Network , 2016, ESANN.

[18]  Roland Siegwart,et al.  Kinect v2 for mobile robot navigation: Evaluation and modeling , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[19]  Andy Cockburn,et al.  FingARtips: gesture based direct manipulation in Augmented Reality , 2004, GRAPHITE '04.

[20]  Carlo S. Regazzoni,et al.  Hand detection in First Person Vision , 2013, Proceedings of the 16th International Conference on Information Fusion.

[21]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[23]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[25]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[27]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.