Affective interaction recognition using spatio-temporal features and context

A hierarchical representation structure for interaction recognition is introduced.We adopt hierarchical coding models to encode low-level features.A segmental clustering method is applied to extract the mid-level features.Contextual information is incorporated with motion features by extracting the interactive contours.We demonstrate empirical results on three datasets. This paper focuses on recognizing the human interaction relative to human emotion, and addresses the problem of interaction features representation. We propose a two-layer feature description structure that exploits the representation of spatio-temporal motion features and context features hierarchically. On the lower layer, the local features for motion and interactive context are extracted respectively. We first characterize the local spatio-temporal trajectories as the motion features. Instead of hand-crafted features, a new hierarchical spatio-temporal trajectory coding model is presented to learn and represent the local spatio-temporal trajectories. To further exploit the spatial and temporal relationships in the interactive activities, we then propose an interactive context descriptor, which extracts the local interactive contours from frames. These contours implicitly incorporate the contextual spatial and temporal information. On the higher layer, semi-global features are represented based on the local features encoded on the lower layer. And a spatio-temporal segment clustering method is designed for features extraction on this layer. This method takes the spatial relationship and temporal order of local features into account and creates the mid-level motion features and mid-level context features. Experiments on three challenging action datasets in video, including HMDB51, Hollywood2 and UT-Interaction, are conducted. The results demonstrate the efficacy of the proposed structure, and validate the effectiveness of the proposed context descriptor.

[1]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Antonios Gasteratos,et al.  On-line deep learning method for action recognition , 2014, Pattern Analysis and Applications.

[5]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[6]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[7]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[8]  D H HUBEL,et al.  RECEPTIVE FIELDS AND FUNCTIONAL ARCHITECTURE IN TWO NONSTRIATE VISUAL AREAS (18 AND 19) OF THE CAT. , 1965, Journal of neurophysiology.

[9]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[10]  D. Burr,et al.  Feature detection in human vision: a phase-dependent energy model , 1988, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[11]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[12]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[14]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[15]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  B. Gelder Towards the neurobiology of emotional body language , 2006, Nature Reviews Neuroscience.

[18]  A. Hyvärinen,et al.  A multi-layer sparse coding network learns contour coding from natural images , 2002, Vision Research.

[19]  H. Meeren,et al.  Rapid perceptual integration of facial expression and emotional body language. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[20]  H. Wallbott Bodily expression of emotion , 1998 .

[21]  R. D. Walk,et al.  Perception of the smile and other emotions of the body and face at different distances , 1988 .

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Sharath Pankanti,et al.  Temporal Sequence Modeling for Video Event Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Hichem Sahbi,et al.  Mid-level features and spatio-temporal context for activity recognition , 2012, Pattern Recognit..

[25]  Ying Wu,et al.  Action recognition with multiscale spatio-temporal contexts , 2011, CVPR 2011.

[26]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[27]  Tao Zhang,et al.  RiskWheel: Interactive visual analytics for surveillance event detection , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[28]  Claire L. Roether,et al.  Critical features for the perception of emotion from gait. , 2009, Journal of vision.

[29]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[31]  Daniel A. Pollen,et al.  Visual cortical neurons as localized spatial frequency filters , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[32]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[33]  A. Atkinson,et al.  Evidence for distinct contributions of form and motion information to the recognition of emotions from body gestures , 2007, Cognition.

[34]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[35]  Greg Mori,et al.  Discriminative key-component models for interaction detection and recognition , 2015, Comput. Vis. Image Underst..

[36]  M. D. Meijer The contribution of general features of body movement to the attribution of emotions , 1989 .

[37]  Lin Sun,et al.  DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[39]  Tae-Kyun Kim,et al.  Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[40]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[41]  Alessia Saggese,et al.  Exploiting the deep learning paradigm for recognizing human actions , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[42]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[43]  Peter Robinson,et al.  Detecting Emotions from Connected Action Sequences , 2009, IVIC.

[44]  P. Ekman,et al.  Head and body cues in the judgment of emotion: a reformulation. , 1967, Perceptual and motor skills.

[45]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[46]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[47]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[49]  Leon Bobrowski Ranked linear models and sequential patterns recognition , 2007, Pattern Analysis and Applications.

[50]  Ana Paiva,et al.  Automatic analysis of affective postures and body motion to detect engagement with a game companion , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[51]  Antonio Camurri,et al.  Toward a Minimal Representation of Affective Gestures , 2011, IEEE Transactions on Affective Computing.

[52]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[53]  Hedvig Kjellström,et al.  Functional object descriptors for human activity modeling , 2013, 2013 IEEE International Conference on Robotics and Automation.

[54]  J. Montepare,et al.  The Use of Body Movements and Gestures as Cues to Emotions in Younger and Older Adults , 1999 .

[55]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.