Real-time continuous gesture recognition for natural multimodal interaction

I have developed a real-time continuous gesture recognition system capable of dealing with two important problems that have previously been neglected: (a) smoothly handling two different kinds of gestures: those characterized by distinct paths and those characterized by distinct hand poses; and (b) determining how and when the system should respond to gestures. The novel approaches in this thesis include: a probabilistic recognition framework based on a flattened hierarchical hidden Markov model (HHMM) that unifies the recognition of path and pose gestures; and a method of using information from the hidden states in the HMM to identify different gesture phases (the pre-stroke, the nucleus and the post-stroke phases), allowing the system to respond appropriately to both gestures that require a discrete response and those needing a continuous response. The system is extensible: new gestures can be added by recording 3-6 repetitions of the gesture; the system will train an HMM model for the gesture and integrate it into the existing HMM, in a process that takes only a few minutes. Our evaluation shows that even using only a small number of training examples (e.g. 6), the system can achieve an average F1 score of 0.805 for two forms of gestures. To evaluate the performance of my system I collected a new dataset (YANG dataset) that includes both path and pose gestures, offering a combination currently lacking in the community and providing the challenge of recognizing different types of gestures mixed together. I also developed a novel hybrid evaluation metric that is more relevant to realtime interaction with different gesture flows. Thesis Supervisor: Randall Davis Title: Professor

[1]  Eric C. Larson,et al.  HeatWave: thermal imaging for surface user interaction , 2011, CHI.

[2]  Anbumani Subramanian,et al.  Dynamic Hand Pose Recognition Using Depth Data , 2010, 2010 20th International Conference on Pattern Recognition.

[3]  Michael I. Mandel,et al.  Visual Hand Tracking Using Nonparametric Belief Propagation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Adrian Kaehler,et al.  Learning opencv, 1st edition , 2008 .

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Gang Qian,et al.  Online Gesture Spotting from Visual Hull Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jock D. Mackinlay,et al.  The information visualizer, an information workspace , 1991, CHI.

[9]  Shumin Zhai,et al.  Making touchscreen keyboards adaptive to keys, hand postures, and individuals: a hierarchical spatial backoff model approach , 2013, CHI.

[10]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[11]  Hermann Hienz,et al.  Relevant features for video-based continuous sign language recognition , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[12]  Yoichi Sato,et al.  Real-Time Fingertip Tracking and Gesture Recognition , 2002, IEEE Computer Graphics and Applications.

[13]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[14]  Dmitry B. Goldgof,et al.  Gesture recognition using Bezier curves for visualization navigation from registered 3-D data , 2004, Pattern Recognit..

[15]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[16]  Alvaro Marcos-Ramiro,et al.  Body communicative cue extraction for conversational analysis , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[18]  Sabrina Dicintio Comparing Approaches to Initializing the Expectation-Maximization Algorithm , 2012 .

[19]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[20]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[21]  Rajeev Sharma,et al.  Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[22]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[23]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Baptiste Caramiaux,et al.  Realtime Segmentation and Recognition of Gestures Using Hierarchical Markov Models , 2022 .

[25]  Rajeev Sharma,et al.  Designing a human-centered, multimodal GIS interface to support emergency management , 2002, GIS '02.

[26]  Elena Mugellini,et al.  ChAirGest: a challenge for multimodal mid-air gesture recognition for close HCI , 2013, ICMI '13.

[27]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Dan Rosenfeld,et al.  Going beyond the display: a surface technology with an electronically switchable diffuser , 2008, UIST '08.

[29]  Sylvain Paris,et al.  6D hands: markerless hand-tracking for computer aided design , 2011, UIST.

[30]  Chris Harrison,et al.  OmniTouch: wearable multitouch interaction everywhere , 2011, UIST.

[31]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[32]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[33]  Paul Lukowicz,et al.  Performance metrics for activity recognition , 2011, TIST.

[34]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[35]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[36]  Saul Greenberg,et al.  Multimodal multiplayer tabletop gaming , 2007, CIE.

[37]  Sidney S. Fels,et al.  ForTouch: A Wearable Digital Ventriloquized Actor , 2009, NIME.

[38]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[39]  Seong-Whan Lee,et al.  Gesture Spotting and Recognition for Human–Robot Interaction , 2007, IEEE Transactions on Robotics.

[40]  Ying Yin,et al.  Toward natural interaction in the real world: real-time gesture recognition , 2010, ICMI-MLMI '10.

[41]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[42]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[43]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[46]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[47]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[48]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[49]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[50]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[51]  R. Davis Toward an Intelligent Multimodal Interface for Natural Interaction , 2009 .

[52]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[53]  Isabelle Guyon,et al.  The ChaLearn gesture dataset (CGD 2011) , 2014, Machine Vision and Applications.

[54]  Yale Song,et al.  Multi-signal gesture recognition using temporal smoothing hidden conditional random fields , 2011, Face and Gesture 2011.

[55]  Pattie Maes,et al.  SixthSense: a wearable gestural interface , 2009, SIGGRAPH ASIA Art Gallery & Emerging Technologies.

[56]  Meredith Ringel Morris,et al.  User-defined gestures for surface computing , 2009, CHI.

[57]  Mohammed Waleed Kadous,et al.  Temporal classification: extending the classification paradigm to multivariate time series , 2002 .

[58]  Gary Bradski,et al.  Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[59]  Yale Song,et al.  Continuous body and hand gesture recognition for natural human-computer interaction , 2012, TIIS.

[60]  Ying Yin,et al.  Gesture spotting and recognition using salience detection and concatenated hidden markov models , 2013, ICMI '13.

[61]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[62]  Yale Song,et al.  Tracking body and hands for gesture recognition: NATOPS aircraft handling signals database , 2011, Face and Gesture 2011.

[63]  Jovan Popović,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH 2009.

[64]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[66]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[67]  Kourosh Khoshelham,et al.  Accuracy analysis of kinect depth data , 2012 .

[68]  Beth Levy,et al.  Conceptual Representations in Lan-guage Activity and Gesture , 1980 .

[69]  Matthew Turk,et al.  View-based interpretation of real-time optical flow for gesture recognition , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.