Joint label-interaction learning for human action recognition

Human interactions and their action categories preserve strong correlations, and the identification of the interaction configuration is of significant importance to improve the action recognition result. However, interactions are typically estimated using heuristics or treated as latent variables. The former usually produces incorrect interaction configuration while the latter introduces challenging training problem. Hence we propose a framework to jointly learn interactions and actions by designing a potential function using both features learned via deep neural networks and human interaction context. We propose an iterative approach to solve the associated inference problem efficiently and approximately. Experimental results on real datasets demonstrate that the proposed approach outperforms baselines by a large margin, and is competitive compared with the state-of-the-arts.

[1]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[5]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[8]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Alexander C. Berg,et al.  Combining multiple sources of knowledge in deep CNNs for action recognition , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[12]  In-So Kweon,et al.  Real-Time Head Orientation from a Monocular Camera Using Deep Neural Network , 2014, ACCV.

[13]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[14]  Zhenhua Wang,et al.  Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Zhenhua Wang,et al.  A Spatio-Temporal CRF for Human Interaction Understanding , 2017, IEEE Transactions on Circuits and Systems for Video Technology.