Motion Part Regularization: Improving action recognition via trajectory group selection

Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important motion parts will lead to a more discriminative action representation. Inspired by recent advances in sentence regularization for text classification, we introduce a Motion Part Regularization framework to mine for discriminative groups of dense trajectories which form important motion parts. First, motion part candidates are generated by spatio-temporal grouping of densely extracted trajectories. Second, an objective function which encourages sparse selection for these trajectory groups is formulated together with an action class discriminative term. Then, we propose an alternative optimization algorithm to efficiently solve this objective function by introducing a set of auxiliary variables which correspond to the discriminativeness weights of each motion part (trajectory group). These learned motion part weights are further utilized to form a discriminativeness weighted Fisher vector representation for each action sample for final classification. The proposed motion part regularization framework achieves the state-of-the-art performances on several action recognition benchmarks.

[1]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[4]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[8]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[10]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Bingbing Ni,et al.  Pose Adaptive Motion Feature Pooling for Human Action Analysis , 2014, International Journal of Computer Vision.

[20]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[21]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[22]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[23]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[25]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[26]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[29]  Noah A. Smith,et al.  Making the Most of Bag of Words: Sentence Regularization with Alternating Direction Method of Multipliers , 2014, ICML.

[30]  Li Wang,et al.  Discriminative human action segmentation and recognition using semi-Markov model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[32]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[34]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[35]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.