Recognizing Visual Signatures of Spontaneous Head Gestures

Head movements are an integral part of human nonverbal communication. As such, the ability to detect various types of head gestures from video is important for robotic systems that need to interact with people or for assistive technologies that may need to detect conversational gestures to aid communication. To this end, we propose a novel Multi-Scale Deep Convolution-LSTM architecture, capable of recognizing short and long term motion patterns found in head gestures, from video data of natural and unconstrained conversations. In particular, our models use Convolutional Neural Networks (CNNs) to learn meaningful representations from short time windows over head motion data. To capture longer term dependencies, we use Recurrent Neural Networks (RNNs) that extract temporal patterns across the output of the CNNs. We compare against classical approaches using discriminative and generative graphical models and show that our model is able to significantly outperform baseline models.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Ashish Kapoor,et al.  A real-time head nod and shake detector , 2001, PUI '01.

[3]  M. Knapp,et al.  Nonverbal communication in human interaction , 1972 .

[4]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Gerard V. Trunk,et al.  A Problem of Dimensionality: A Simple Example , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Evelyn Z. McClave Linguistic functions of head movements in the context of speech , 2000 .

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  U. Hadar,et al.  Head movement during listening turns in conversation , 1985 .

[11]  Christian Wallraven,et al.  Cardiff Conversation Database (CCDb): A Database of Natural Dyadic Conversations , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Bülent Sankur,et al.  Robust classification of face and head gestures in video , 2011, Image Vis. Comput..

[13]  Trevor Darrell,et al.  Head gesture recognition in intelligent interfaces: the role of context in improving recognition , 2006, IUI '06.

[14]  Bülent Sankur,et al.  A comparative study of face landmarking techniques , 2013, EURASIP J. Image Video Process..

[15]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[16]  H Choi Head gesture recognition using HMMs , 1999 .

[17]  Gang Rong,et al.  A real-time head nod and shake detector using HMMs , 2003, Expert Syst. Appl..

[18]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Trevor Darrell,et al.  Contextual recognition of head gestures , 2005, ICMI '05.