Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition

This paper proposes a hierarchical clustering multi-task learning (HC-MTL) method for joint human action grouping and recognition. Specifically, we formulate the objective function into the group-wise least square loss regularized by low rank and sparsity with respect to two latent variables, model parameters and grouping information, for joint optimization. To handle this non-convex optimization, we decompose it into two sub-tasks, multi-task learning and task relatedness discovery. First, we convert this non-convex objective function into the convex formulation by fixing the latent grouping information. This new objective function focuses on multi-task learning by strengthening the shared-action relationship and action-specific feature learning. Second, we leverage the learned model parameters for the task relatedness measure and clustering. In this way, HC-MTL can attain both optimal action models and group discovery by alternating iteratively. The proposed method is validated on three kinds of challenging datasets, including six realistic action datasets (Hollywood2, YouTube, UCF Sports, UCF50, HMDB51 <inline-formula> <tex-math notation="LaTeX">$\&$</tex-math><alternatives><inline-graphic xlink:href="liu-ieq1-2537337.gif"/> </alternatives></inline-formula> UCF101), two constrained datasets (KTH <inline-formula><tex-math notation="LaTeX"> $\&$</tex-math><alternatives><inline-graphic xlink:href="liu-ieq2-2537337.gif"/></alternatives></inline-formula> TJU), and two multi-view datasets (MV-TJU <inline-formula><tex-math notation="LaTeX">$\&$</tex-math><alternatives> <inline-graphic xlink:href="liu-ieq3-2537337.gif"/></alternatives></inline-formula> IXMAS). The extensive experimental results show that: 1) HC-MTL can produce competing performances to the state of the arts for action recognition and grouping; 2) HC-MTL can overcome the difficulty in heuristic action grouping simply based on human knowledge; 3) HC-MTL can avoid the possible inconsistency between the subjective action grouping depending on human knowledge and objective action grouping based on the feature subspace distributions of multiple actions. Comparison with the popular clustered multi-task learning further reveals that the discovered latent relatedness by HC-MTL aids inducing the group-wise multi-task learning and boosts the performance. To the best of our knowledge, ours is the first work that breaks the assumption that all actions are either independent for individual learning or correlated for joint modeling and proposes HC-MTL for automated, joint action grouping and modeling.

[1]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[4]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[5]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[6]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[10]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[11]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[13]  Jiayu Zhou,et al.  Clustered Multi-Task Learning Via Alternating Structure Optimization , 2011, NIPS.

[14]  Jiayu Zhou,et al.  Integrating low-rank and group-sparse structures for robust multi-task learning , 2011, KDD.

[15]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[16]  Patrick Pérez,et al.  View-Independent Action Recognition from Temporal Self-Similarities , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[18]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[19]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[20]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[21]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[22]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[24]  Chunfeng Yuan,et al.  Multi-task Sparse Learning with Beta Process Prior for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Qiang Zhou,et al.  Learning to Share Latent Tasks for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Chunheng Wang,et al.  Attribute Regularization Based Human Action Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[27]  Henri Bouma,et al.  Action recognition by layout, selective sampling and soft-assignment , 2013, ICCV 2013.

[28]  Dong Xu,et al.  Action Recognition Using Multilevel Features and Latent Structural SVM , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[30]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[33]  Ivor W. Tsang,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Soft Margin Multiple Kernel Learning , 2022 .

[34]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Feng Wang,et al.  Experimenting Motion Relativity for Action Recognition with a Large Number of Classes , 2013 .

[36]  R. Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[37]  Lynne E. Parker,et al.  Simplex-Based 3D Spatio-temporal Feature Description for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Dewen Hu,et al.  Learning Effective Event Models to Recognize a Large Number of Human Actions , 2014, IEEE Transactions on Multimedia.

[39]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[40]  Jiayu Zhou,et al.  Low-Rank and Sparse Multi-task Learning , 2014, Low-Rank and Sparse Modeling for Visual Analysis.

[41]  Stephen J. Maybank,et al.  Learning Human Actions by Combining Global Dynamics and Local Appearance , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Andrew Zisserman,et al.  Improving Human Action Recognition Using Score Distribution and Ranking , 2014, ACCV.

[43]  K. R. Ramakrishnan,et al.  A Cause and Effect Analysis of Motion Trajectories for Modeling Actions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Yun Fu,et al.  Low-Rank and Sparse Modeling for Visual Analysis , 2014, Springer International Publishing.

[45]  Ling Shao,et al.  Multi-Max-Margin Support Vector Machine for multi-source human action recognition , 2014, Neurocomputing.

[46]  Ling Shao,et al.  Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Chunfeng Yuan,et al.  Human Action Recognition Based on Context-Dependent Graph Kernels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Xiaodong Yang,et al.  Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness , 2014, ECCV.

[49]  Lin Sun,et al.  DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Wen Gao,et al.  Mining Compact Bag-of-Patterns for Low Bit Rate Mobile Visual Search , 2014, IEEE Transactions on Image Processing.

[51]  Gang Hua,et al.  Weakly Supervised Visual Dictionary Learning by Harnessing Image Attributes , 2014, IEEE Transactions on Image Processing.

[52]  Yu-Ting Su,et al.  Single/multi-view human action recognition via regularized multi-task learning , 2015, Neurocomputing.

[53]  Zan Gao,et al.  Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition , 2015, Signal Process..

[54]  Yi Yang,et al.  UTS-CMU at THUMOS 2015 , 2015 .

[55]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[57]  Xiao-Lei Zhang,et al.  Convex Discriminative Multitask Clustering , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[59]  Hyunjong Cho,et al.  Evaluation of LC-KSVD on UCF 101 Action Dataset , .