Continuous body and hand gesture recognition for natural human-computer interaction

Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as it should be, not on the interaction modality. We present a new approach to gesture recognition that attends to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input stream. This article describes the whole procedure of continuous body and hand gesture recognition, from the signal acquisition to processing, to the interpretation of the processed signals. Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body postures are reconstructed in 3D space using a generative model-based approach with a particle filter, combining both static and dynamic attributes of motion as the input feature to make tracking robust to self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into one of several canonical hand shapes using an appearance-based approach with a multiclass support vector machine. Finally, the extracted body and hand features are combined and used as the input feature for gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-dynamic conditional random field is used with a temporal sliding window to perform the task continuously. We augment this with a novel technique called multilayered filtering, which performs filtering both on the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes of multiple overlapping prediction results as well as reducing estimation noise. We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both static and dynamic attributes of motion in body tracking allows statistically significant improvement of the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering statistically significantly improves recognition performance over the nonfiltering method. We also show that, on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.

[1]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3]  Jingzhou Yang,et al.  Survey of Biomechanical Models for the Human Shoulder Complex , 2008 .

[4]  BlakeAndrew,et al.  Real-time human pose recognition in parts from single depth images , 2013 .

[5]  Yale Song,et al.  Tracking body and hands for gesture recognition: NATOPS aircraft handling signals database , 2011, Face and Gesture 2011.

[6]  Trevor Darrell,et al.  3-D articulated pose tracking for untethered diectic reference , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[7]  Jovan Popovic,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH '09.

[8]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[9]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  S. Mitra,et al.  Gesture Recognition: A Survey , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[12]  DavisRandall,et al.  Continuous body and hand gesture recognition for natural human-computer interaction , 2012 .

[13]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[14]  Cristian Sminchisescu,et al.  Kinematic jump processes for monocular 3D human tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[15]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Yale Song,et al.  Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[19]  Paul A. Beardsley,et al.  Computer Vision for Interactive Computer Graphics , 1998, IEEE Computer Graphics and Applications.

[20]  Jitendra Malik,et al.  Recovering 3D human body configurations using shape contexts , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ying Yin,et al.  Toward natural interaction in the real world: real-time gesture recognition , 2010, ICMI-MLMI '10.

[22]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[23]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[25]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[28]  Mun Wai Lee,et al.  A model-based approach for estimating human 3D poses in static images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  David Fofi,et al.  A comparative survey on invisible structured light , 2004, IS&T/SPIE Electronic Imaging.

[30]  Nassir Navab,et al.  Estimating human 3D pose from Time-of-Flight images based on geodesic distances and optical flow , 2011, Face and Gesture 2011.

[31]  Andrew Blake,et al.  Articulated body motion capture by annealed particle filtering , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[32]  Barr,et al.  Superquadrics and Angle-Preserving Transformations , 1981, IEEE Computer Graphics and Applications.

[33]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[34]  James W. Davis,et al.  Real-time recognition of activity using temporal templates , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[35]  Larry S. Davis,et al.  Real-time foreground-background segmentation using codebook model , 2005, Real Time Imaging.

[36]  Mircea Nicolescu,et al.  Vision-based hand pose estimation: A review , 2007, Comput. Vis. Image Underst..

[37]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[38]  J. Denavit,et al.  A kinematic notation for lower pair mechanisms based on matrices , 1955 .

[39]  Ronald Poppe,et al.  Vision-based human motion analysis: An overview , 2007, Comput. Vis. Image Underst..

[40]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[41]  KimKyungnam,et al.  Real-time foreground-background segmentation using codebook model , 2005 .

[42]  S. Burak Gokturk,et al.  A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[43]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[44]  Yale Song,et al.  Multimodal human behavior analysis: learning correlation and interaction across modalities , 2012, ICMI '12.

[45]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[46]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[47]  Louis-Philippe Morency,et al.  Virtual Rapport 2.0 , 2011, IVA.

[48]  Yang Wang,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[52]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[53]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[54]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Yale Song,et al.  Multi-signal gesture recognition using temporal smoothing hidden conditional random fields , 2011, Face and Gesture 2011.

[56]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[57]  Tom G. Zimmerman,et al.  A hand gesture interface device , 1987, CHI '87.

[58]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[59]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[60]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[61]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[62]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[63]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[64]  Maja Pantic,et al.  Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition , 2011, Face and Gesture 2011.

[65]  A. E. Engin,et al.  On the biomechanics of the shoulder complex. , 1980, Journal of biomechanics.

[66]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).