Hidden Conditional Random Fields for Gesture Recognition

We introduce a discriminative hidden-state approach for the recognition of human gestures. Gesture sequences often have a complex underlying structure, and models that can incorporate hidden structures have proven to be advantageous for recognition tasks. Most existing approaches to gesture recognition with hidden states employ a Hidden Markov Model or suitable variant (e.g., a factored or coupled state model) to model gesture streams; a significant limitation of these models is the requirement of conditional independence of observations. In addition, hidden states in a generative model are selected to maximize the likelihood of generating all the examples of a given gesture class, which is not necessarily optimal for discriminating the gesture class against other gestures. Previous discriminative approaches to gesture sequence recognition have shown promising results, but have not incorporated hidden states nor addressed the problem of predicting the label of an entire sequence. In this paper, we derive a discriminative sequence model with a hidden state structure, and demonstrate its utility both in a detection and in a multi-way classification formulation. We evaluate our method on the task of recognizing human arm and head gestures, and compare the performance of our method to both generative hidden state and discriminative fully-observable models.

[1]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[4]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[5]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Kirsti Grobel,et al.  Video-Based Sign Language Recognition Using Hidden Markov Models , 1997, Gesture Workshop.

[7]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[8]  Ashish Kapoor,et al.  A real-time head nod and shake detector , 2001, PUI '01.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Trevor Darrell,et al.  3-D articulated pose tracking for untethered diectic reference , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[11]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[12]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[13]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Trevor Darrell,et al.  Adaptive view-based appearance models , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[15]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[16]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[17]  Antonio Torralba,et al.  Contextual Models for Object Detection Using Boosted Random Fields , 2004, NIPS.

[18]  T. Kobayashi,et al.  A conversation robot using head gesture recognition as para-linguistic information , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[19]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Trevor Darrell,et al.  Contextual recognition of head gestures , 2005, ICMI '05.

[21]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[22]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.